perl out sorts sort

Once again, I found myself needing to sort a bunch of data, but this time from within a perl script written by a friend, which I was augmenting. The script was doing some ssh magic to retrieve a lot of data from several servers and building a 100+ MB file for each box. That is the data I was needing to sort. Since I was already using backticks for the SSH I considered using that to also sort the data. “But wait!” I thought. “maybe I should just use the Perl provided sort function”. I decided to let performance be my guide.

What I found was the the Perl sort function was almost 50% fast than the stand UNIX/Linux command line sort. The timings looked like this, for two of the files:

The linux sort ran as follows:

# time cat input-file1 | sort >> outsort-linux

real 0m27.035s
user 0m25.402s
sys 0m1.557s

# time cat input-file2 | sort >> outsort-linux

real 0m57.782s
user 0m53.131s
sys 0m2.509s

While the Perl version generated the following timings:

# time perl -e 'open(FILE, "<input-file1"; @sorted = sort @file; open(OFILE, ">outsort-perl"); print OFILE @sorted;'

real 0m17.281s
user 0m16.064s
sys 0m1.158s

# time perl -e 'open(FILE, "<input-file2"; @sorted = sort @file; open(OFILE, ">outsort-perl"); print OFILE @sorted;'

real 0m31.391s
user 0m28.611s
sys 0m1.776s

This was of course not good enough and I experimented with not using cat and just passing the filename as an argument to sort”, but that did not change the numbers. I then used the -S option to sort and set the buffer size to be larger than the file. This produced better results:

# time cat input-file2 | sort -S 170M > outsort-linux

real 0m49.976s
user 0m46.529s
sys 0m2.873s

But the Perl sort function again won:

# time perl -e 'open(FILE, "<input-file2"; @sorted = sort @file; open(OFILE, ">outsort-perl"); print OFILE @sorted;'

real 0m28.321s
user 0m26.403s
sys 0m1.781s

With all that said, the Perl version, will require a lot of memory, since it reads the entire file into memory, while the Linux sort can likely be much more graceful in it’s memory consumption, but in this case I was running on server sized hardware with plenty of RAM and shaving 10-20 seconds of something I needed to do several times and in short order made sense.


Leave a Reply

Your email address will not be published. Required fields are marked *