Sunday, March 09, 2008

[TECH] Algorithmic details of UNIX Sort command.

I happened to look at the algorithmic details of UNIX Sort, a LINUX version of the classic UNIX sort is a part of GNU coreutils-6.9.90. This is classic example of the standard External R-Way merge , to sort a data of size N bytes with a main memory size of M so it creates N/M runs and merges R at a time, the number of passes through the data is log(N/M)/log(R) passes.In fact the lower bound(runtime) for external sorting is Ω((N/M)log(N/M)/log(R)). All the external memory sorting algorithms provided in the literature are optimal so the fight here is minimizing the constant before the number of passes.

UNIX sort treats keys are lines (strings), the algorithm followed by unix sort is in fact the R-Way merge. Let the input file size be IN_SIZE.
1. Choosing Run Size:
--------------------------------
The sizes of the initial runs are chosen from the total physical memory (TOTAL_PHY) and available memory (AVAIL_PHY). RUN_SIZE = (MAX(TOTAL_PHY/8,AVAIL_PHY))/2
maximum of 1/8th of TOTAL_PHY and AVAIL_PHY and divided by 2. See function "default_sort_size (void)" in the code.
2. Creating Runs:
-------------------------
Unix sort creates a temporary file for every run. So it creates IN_SIZE/RUN_SIZE (celing) temporary files. Internally it uses merge sort to sort internally it uses an optimization mentioned in Knuth volume 3 (2nd edition), problem 5.2.4-23.
3. Merging:
----------------
The number of runs merged at any time is hard coded in the program see macro NMERGE , NMERGE is defined to be 16 so it merges exactly 16 runs at any time.

1 comment:

Jesús said...

I'm wondering if execution time in some scenarios couldn't benefit from using different algorithms. For example, when using a sort command within a long pipe, a selection sort algorithm might be better because, despite being asimptotically worse, it starts outputting results sooner than merge sort, thus reducing the "stall" added by the sort.

The drawback would be memory consumption and perhaps efficiency when using very large files (although if that's a problem would depend on the rest of the pipeline), so it shouldn't be thought as a general solution. But perhaps a command-line option for using a "low-latency" algorithm, or even an automatic switch when in and out were stdin and stdout, would be handy.