Kshitij Banerjee
Kshitij Banerjee

Reputation: 1748

Find common elements from two very large Arrays

There are two integer arrays ,each in very large files (size of each is larger than RAM). How would you find the common elements in the arrays in linear time.

I cant find a decent solution to this problem. Any ideas?

Upvotes: 10

Views: 3158

Answers (6)

Onur Senture
Onur Senture

Reputation: 519

You can obviously use an hash table to find common elements with O(n) time complexity.

First, you need to create an hash table using the first array, then compare the second array using this hash table.

Upvotes: 4

Ganesh
Ganesh

Reputation: 11

Let's say enough RAM is available to hold 5% of hash of either given file-array (FA).

So, I can split the file arrays (FA1 and FA2) into 20 chunks each - say do a MOD 20 of the contents. We get FA1(0)....FA1(19) and FA2(0)......FA2(19). This can be done in linear time.

Hash FA1(0) in memory and compare contents of FA2(0) with this hash. Hashing and checking for existence are constant time operations.

Destroy this hash and repeat for FA1(1)...FA1(19). This is also linear. So, the whole operation is linear.

Upvotes: 1

CommonMan
CommonMan

Reputation: 3928

Assuming integer size is 4 bytes. Now we can have maximum of 2^32 integers i.e I can have a bitvector of 2^32 bits (512 MB) to represent all integers where each bit reperesents 1 integer. 1. Initialize this vector with all zeroes 2. Now go through one file and set bits in this vector to 1 if you find an integer. 3. Now go through other file and look for any set bit in bit Vector.

Time complexity O(n+m) space complexity 512 MB

Upvotes: 6

blaze
blaze

Reputation: 4364

Sort files. With fixed length integers it can be done in O(n) time:

  1. Get some part of file, sort it with radix sort, write to temporary file. Repeat until all data finished. This part is O(n)
  2. Merge sorted parts. This is O(n) too. You can even skip repeated numbers.

On sorted files find a common subset of integers: compare numbers, write it down if they are equal, then step one number ahead on file with smaller number. This is O(n).

All operations are O(n) and final algorithm is O(n) too.

EDIT: bitmap method is much faster if you have enough memory for bitmaps. This method works for any fixed size integers, 64-bit for example. Bitmap of size 2^31 Mb will not be practical for at least a few years :)

Upvotes: -1

João Augusto
João Augusto

Reputation: 2305

Assuming you are talking of integers with the same size, and written in the files in binary mode, you first sort the 2 files (use a quicksort, but reading and writing to the file "offsets" ). Then you just need to move from the start of the 2 files, and check for matches, if you have a match write the output to another file (assuming you can't also store the result in memory) and keep moving on the files until EOF.

Upvotes: -1

AProgrammer
AProgrammer

Reputation: 52284

One pass on one file build a bitmap (or a Bloom filter if the integer range is too large for a bitmap in memory).

One pass on the other file find the duplicates (or candidates if using a Bloom filter).

If you use a Bloom filter, the result is probabilistic. New passes can reduce the false positive (Bloom filters don't have false negative).

Upvotes: 12

Related Questions