Reputation: 1748
There are two integer arrays ,each in very large files (size of each is larger than RAM). How would you find the common elements in the arrays in linear time.
I cant find a decent solution to this problem. Any ideas?
Upvotes: 10
Views: 3158
Reputation: 519
You can obviously use an hash table to find common elements with O(n) time complexity.
First, you need to create an hash table using the first array, then compare the second array using this hash table.
Upvotes: 4
Reputation: 11
Let's say enough RAM is available to hold 5% of hash of either given file-array (FA).
So, I can split the file arrays (FA1 and FA2) into 20 chunks each - say do a MOD 20 of the contents. We get FA1(0)....FA1(19) and FA2(0)......FA2(19). This can be done in linear time.
Hash FA1(0) in memory and compare contents of FA2(0) with this hash. Hashing and checking for existence are constant time operations.
Destroy this hash and repeat for FA1(1)...FA1(19). This is also linear. So, the whole operation is linear.
Upvotes: 1
Reputation: 3928
Assuming integer size is 4 bytes. Now we can have maximum of 2^32 integers i.e I can have a bitvector of 2^32 bits (512 MB) to represent all integers where each bit reperesents 1 integer. 1. Initialize this vector with all zeroes 2. Now go through one file and set bits in this vector to 1 if you find an integer. 3. Now go through other file and look for any set bit in bit Vector.
Time complexity O(n+m) space complexity 512 MB
Upvotes: 6
Reputation: 4364
Sort files. With fixed length integers it can be done in O(n) time:
On sorted files find a common subset of integers: compare numbers, write it down if they are equal, then step one number ahead on file with smaller number. This is O(n).
All operations are O(n) and final algorithm is O(n) too.
EDIT: bitmap method is much faster if you have enough memory for bitmaps. This method works for any fixed size integers, 64-bit for example. Bitmap of size 2^31 Mb will not be practical for at least a few years :)
Upvotes: -1
Reputation: 2305
Assuming you are talking of integers with the same size, and written in the files in binary mode, you first sort the 2 files (use a quicksort, but reading and writing to the file "offsets" ). Then you just need to move from the start of the 2 files, and check for matches, if you have a match write the output to another file (assuming you can't also store the result in memory) and keep moving on the files until EOF.
Upvotes: -1
Reputation: 52284
One pass on one file build a bitmap (or a Bloom filter if the integer range is too large for a bitmap in memory).
One pass on the other file find the duplicates (or candidates if using a Bloom filter).
If you use a Bloom filter, the result is probabilistic. New passes can reduce the false positive (Bloom filters don't have false negative).
Upvotes: 12