I have my preferences on scripting languages. Many I work with do as well. I tend to fall back to good old shell for down and dirty stuff. I’ve written a lot of perl and ruby sure seems neat. I’ve struggle with python’s whitespace. Setting all of these this aside I decided to do a comparison of different languages with various tasks.
I decided I was going to use the following tasks:
- sorting … using different file sizes, since that gets harder with larger sizes
- md5 calculation … It’s a pretty steady computation with linear complexity
- text replacement … because regular expressions rule
- HTTP get request … it’s important in a port 80 based world
For the sorting, I used the words from the English dictionary that comes with aspell on Ubuntu and did some text magic:
strings /var/lib/aspell/en-common.rws | grep -v '*' \\ | sed 's/[[:space:]]*//g' | sort | uniq | tac > words.txt
That produced a lisf of 163379 lines, with one “word” per line. Another file was created by cat’ing the origianl file 10 times,20 times and 50 times, resulting in two more data files with 1,633,790, 3,267,580, and 8,168,950 lines respectively.
The original “word” file above was also used in the text replacement tests.
For the MD5 calculation, I used the first 2500 lines of he origianl words list created above.
The URL get tests were done by taking the original words file and splitting it into 100 files on a localhost webserver and retrieving all of the URLs.
For each test a simple script was created doing the bare minimum of work. Generally this means reading a file, doing something and generating output to standard out. The solutions for the various approaches came mostly from the web to ensure I was taking a reasonable approach without running the risk of stacking the deck against a language due to my ignorance or lack of familiarity.
The tests ran on an old box running a AMD 1.3GHz single core processor with 512MB of RAM. That put all scripts on equal footing in terms of suffering the same disadvantage in terms of computing power.
To be honest, I was surprised at the results. I ran the tests 3 times and took averages. The data can be viewed at Google Docs Performance of Scripting Languages with the raw results in their owns sheets. Here is the summary table, but the google doc looks crisper.
Times are in seconds with fractions as produced by
time -f "%e".
The winner? It was Python which came in twice as fast as the shell script. Perl came in a disappointing third and Ruby was a distant 4th.
What surprised me the most is that neither Perl or Ruby did very well on the sorting. I set a timeout on those runs of 10 minutes, since on an earlier run I let it run until the box started to swap to death. Perl did OK with the 3.2million lines, but failed at the 8.1million line file. Ruby choked on both. For both Ruby and Perl the increased time to sort is pretty obvious, where Python and shell managed to stick much closer to a linear increase in time.
Ruby did very well on the md5sum calculation, beating the other languages by a wide margin.
For text replacement, Perl was the clear winner and Python did not perform well on that. I believe that Python run time might be improved by using compiled regular expression, but in all other cases this was not necessary and so I decided to abstain from specific optimization. Fair is fair.
The shell script using curl was the slowest at fetching the URLs (a recursive wget might make that much faster), with Python again the winner.
While I might still rage against Python and whitespace I do have to tip my hat to the language. It seems that the Python team has done a very nice job performance wise.
I do a lot in Shell and the results suggest that I could do worse.
As an old Perl monkey I was quite disappointed at Perl’s thrid place finish. The fact that it choked on sorting was particularly hard to swallow.
I’ve recently started embracing Ruby. The language is attractive and has great features and flexibility, but performance wise it’s lack of maturity is sure showing.
I realize that performance isn’t everything. Much can be said for writability, readability, available tools, addons, documentation, etc. In this case I was really just looking at performance for things I do a lot. Python seems to do those things better than the rest.
Update 18 Apr 2012
I’ve been asked about the versions of the scripting languages. They are what’s current with Ubuntu Server 11.10:
- Bash: 4.2.10 and lots of other CLI tools …
- Perl: v5.12.4
- Ruby: 1.8.7
I’m sticking with stock since having to track down other versions generally just makes the scripting more time consuming.