Chris
Chris

Reputation: 40641

Is there a more efficient way to reconcile large data sets?

I've been tasked with reconciling two big data sets (two big lists of transactions). Basically i extract the relevant fields from the two data sources into two files of the same format, then compare the files to find any records that are in A but not in B, or vice versa, and report on them. I wrote a blog entry on my best efforts achieving this (click if interested).

The gist of it is to load both data sets into a big hash table, with the keys being the rows, and the values being +1 each time it appears in file A, and -1 each time it appears in file B. Then at the end, i look for any key/value pairs where the value != 0.

My algorithm seems fast enough (10 seconds for 2*100mb files), however its a bit memory-intensive: 280mb to compare two sets of 100mb files, i would hope to get it down to 100mb peak memory usage, and possibly lower if the two data sets are sorted in roughly the same order.

Any ideas?

Also, let me know if this is too open ended for SO.

Upvotes: 5

Views: 5500

Answers (4)

slimbo
slimbo

Reputation: 2759

Using this method you would have to have the contents of one of the files in memory at all times though. It would be more efficient, as far as memory goes, to simply take half of the file in. Compare it line by line against the second file. Then take the second half to memory and do the same. This overlapping would ensure that there are no records missed. And would eliminate the need for the entire file to be stored temporarily.

Upvotes: 0

Kevin Nisbet
Kevin Nisbet

Reputation: 2249

I have done something similar to this only in scripts on unix using shell and perl, however the theory may cary over.

Step 1, sort both files so they are in order by the same criteria. I used the unix sort command to do this (i required the unique flag, but you just need some sort of memory efficient file sort). This is likely the tricky part to figure out on you're own.

Step 2, open both files, and essentially scan them line by line (or record by record if binary format). If the line in the left file is equal to the one in the right file, then the lines match and move on (remember we already sorted the file, so the smallest record should be first).

If left record is greater than right record, you're right record is missing, add it to you're list, and read the next line on the right file. And simply do you're check again. Same thing applies if you're right record is greater, than you left record is missing, report it and keep going.

The scanning the records should be very memory efficient. It may not be as fast, but for me I was able to crunch several gigs of data with multiple passes looking at different fields witihn a couple minutes.

Upvotes: 2

Jonathan Rupp
Jonathan Rupp

Reputation: 15762

One option may be to change the in-memory format of your data. If your data is a series of numbers stored as text, storing them as integers in memory may lower your memory footprint.

Another option may be use some kind of external program to sort the rows -- then you can do a simple scan of the two files in-order looking for differences.

Back to your question though, 280mb sounds high for comparing a pair of 100mb files though -- you are only loading one into memory (the smaller one) and just scrolling through the other one, right? As you describe it, I don't think you'll need to have the full contents of both in memory at once.

Upvotes: 1

mezoid
mezoid

Reputation: 28720

The only way I can think of is to not load all of the data into memory at once. If you change the way you process it so that it grabs a bit of each file at a time it would reduce your memory foot print but increase your disk IO which would probably result in a longer processing time.

Upvotes: 1

Related Questions