Sunday, March 09, 2008

[TECH] Algorithmic details of UNIX Sort command.

I happened to look at the algorithmic details of UNIX Sort, a LINUX version of the classic UNIX sort is a part of GNU coreutils-6.9.90. This is classic example of the standard External R-Way merge , to sort a data of size N bytes with a main memory size of M so it creates N/M runs and merges R at a time, the number of passes through the data is log(N/M)/log(R) passes.In fact the lower bound(runtime) for external sorting is Ω((N/M)log(N/M)/log(R)). All the external memory sorting algorithms provided in the literature are optimal so the fight here is minimizing the constant before the number of passes.

UNIX sort treats keys are lines (strings), the algorithm followed by unix sort is in fact the R-Way merge. Let the input file size be IN_SIZE.
1. Choosing Run Size:
--------------------------------
The sizes of the initial runs are chosen from the total physical memory (TOTAL_PHY) and available memory (AVAIL_PHY). RUN_SIZE = (MAX(TOTAL_PHY/8,AVAIL_PHY))/2
maximum of 1/8th of TOTAL_PHY and AVAIL_PHY and divided by 2. See function "default_sort_size (void)" in the code.
2. Creating Runs:
-------------------------
Unix sort creates a temporary file for every run. So it creates IN_SIZE/RUN_SIZE (celing) temporary files. Internally it uses merge sort to sort internally it uses an optimization mentioned in Knuth volume 3 (2nd edition), problem 5.2.4-23.
3. Merging:
----------------
The number of runs merged at any time is hard coded in the program see macro NMERGE , NMERGE is defined to be 16 so it merges exactly 16 runs at any time.

4 comments:

Anonymous said...

Can anyone recommend the robust Script Deployment system for a small IT service company like mine? Does anyone use Kaseya.com or GFI.com? How do they compare to these guys I found recently: N-able N-central MSP solution
? What is your best take in cost vs performance among those three? I need a good advice please... Thanks in advance!

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.
Jesús said...

I'm wondering if execution time in some scenarios couldn't benefit from using different algorithms. For example, when using a sort command within a long pipe, a selection sort algorithm might be better because, despite being asimptotically worse, it starts outputting results sooner than merge sort, thus reducing the "stall" added by the sort.

The drawback would be memory consumption and perhaps efficiency when using very large files (although if that's a problem would depend on the rest of the pipeline), so it shouldn't be thought as a general solution. But perhaps a command-line option for using a "low-latency" algorithm, or even an automatic switch when in and out were stdin and stdout, would be handy.