alpha_cod
alpha_cod

Reputation: 2033

Fastest way to shuffle lines in a file in Linux

I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?

Upvotes: 21

Views: 16353

Answers (3)

Mikhail
Mikhail

Reputation: 1303

You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.

Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.

The algorithm is as follows.

  1. Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.

  2. As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.

  3. Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.

    • We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
    • Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.

Why it works?

Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.

Some numbers

On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.

Upvotes: 0

dshepherd
dshepherd

Reputation: 5407

Use shuf instead of sort -R (man page).

The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.

(This was suggested in a comment but for some reason not written as an answer by anyone)

Upvotes: 36

Paul Miller
Paul Miller

Reputation: 56

The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.

One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.

To apply this to your problem:

  • generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.

  • create an enriched version of the file using paste:

    paste random_number_file.txt string_data.txt > tmp_string_data.txt

  • sort this file:

    sort tmp_string_data.txt > sorted_tmp_string_data.txt

  • remove the random data:

    cut -f2- sorted_tmp_string_data.txt > random_string_data.txt

This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.

Upvotes: 4

Related Questions