We're All Mad Here
We're All Mad Here

Reputation: 1554

Merge sorted files effectively?

I have n files, 50 <= n <= 100 that contain sorted integers, all of them the same size, 250MB or 500MB.

e.g

1st file: 3, 67, 123, 134, 200, ...
2nd file: 1, 12, 33, 37, 94, ...
3rd file: 11, 18, 21, 22, 1000, ...

I am running this on a 4-core machine and the goal is to merge the files as soon as possible.

Since the total size can reach 50GB I can't read them into RAM.

So far I tried to do the following:

1) Read a number from every file, and store them in an array.
2) Find the lowest number.
3) Write that number to the output.
4) Read one number from the file you found the lowest before (if file not empty)

Repeat steps 2-4 till we have no numbers left.

Reading and writing is done using buffers of 4MB.

My algorithm above works correctly but it's not perfomning as fast as I want it. The biggest issue is that it perfoms much worst if I have 100 files x 250MB compared to having 50 files x 500MB.

What is the most efficient merge algorithm in my case?

Upvotes: 0

Views: 652

Answers (3)

Jeremy
Jeremy

Reputation: 5321

An effective way to utilize the multiple cores might be to perform input and output in distinct threads from the main comparison thread, in such a way that all the cores are kept busy and the main thread never unnecessarily blocks on input or output. One thread performing the core comparison, one writing the output, and NumCores-2 processing input (each from a subset of the input files) to keep the main thread fed.

The input and output threads could also perform stream-specific pre- and post-processing - for example, depending on the distribution of the input data a run length encoding scheme of the type alluded to by @Joop might provide significant speedup of the main thread by allowing it to efficiently order entire ranges of input.

Naturally all of this increases complexity and the possibility of error.

Upvotes: 1

Joop Eggen
Joop Eggen

Reputation: 109557

(java) Use GZipInputStream and GZipOutputStream for the .gz compression. Maybe that will allow memory usage to some extent. Using fast instead of high compression.

Then movement on disk for several files should be reduced, say more merging files by 2 files, both larger sequences.

For repetitions maybe use "run-length-encoding" - instead of repeating, add a repetition count: 11 12 13#7 15

Upvotes: 1

amit
amit

Reputation: 178451

Well, you can first significantly improve efficiency by improving step (2) in your algorithm to be done smartly. Instead to do a linear search on all the numbers, use a min-heap, any insertion and deletion of the minimal value from the heap is done in logarithmic time, so it will improve the speed for large number of files. This changes time complexity to O(nlogk), over the naive O(n*k) (where n is total number of elements and k is number of files)

In addition, you need to minimize number of "random" reads from files, because few sequential big reads are much faster than many small random reads. You can do that by increasing the buffer size, for example (same goes for writing)

Upvotes: 5

Related Questions