Once again, I found myself needing to sort a bunch of data, but this time from within a perl script written by a friend, which I was augmenting. The script was doing some ssh magic to retrieve a lot of data from several servers and building a 100+ MB file for each box. That is the data I was needing to sort. Since I was already using backticks for the SSH I considered using that to also sort the data. “But wait!” I thought. “maybe I should just use the Perl provided sort function”. I decided to let performance be my guide.
What I found was the the Perl sort function was almost 50% fast than the stand UNIX/Linux command line sort. The timings looked like this, for two of the files:
The linux sort ran as follows:
# time cat input-file1 | sort >> outsort-linux
real 0m27.035s
user 0m25.402s
sys 0m1.557s
# time cat input-file2 | sort >> outsort-linux
real 0m57.782s
user 0m53.131s
sys 0m2.509s
While the Perl version generated the following timings:
# time perl -e 'open(FILE, "<input-file1"; @sorted = sort @file; open(OFILE, ">outsort-perl"); print OFILE @sorted;'
real 0m17.281s
user 0m16.064s
sys 0m1.158s
# time perl -e 'open(FILE, "<input-file2"; @sorted = sort @file; open(OFILE, ">outsort-perl"); print OFILE @sorted;'
real 0m31.391s
user 0m28.611s
sys 0m1.776s
This was of course not good enough and I experimented with not using cat and just passing the filename as an argument to sort”, but that did not change the numbers. I then used the -S option to sort and set the buffer size to be larger than the file. This produced better results:
# time cat input-file2 | sort -S 170M > outsort-linux
real 0m49.976s
user 0m46.529s
sys 0m2.873s
But the Perl sort function again won:
# time perl -e 'open(FILE, "<input-file2"; @sorted = sort @file; open(OFILE, ">outsort-perl"); print OFILE @sorted;'
real 0m28.321s
user 0m26.403s
sys 0m1.781s
With all that said, the Perl version, will require a lot of memory, since it reads the entire file into memory, while the Linux sort can likely be much more graceful in it’s memory consumption, but in this case I was running on server sized hardware with plenty of RAM and shaving 10-20 seconds of something I needed to do several times and in short order made sense.
\\@matthias