vmr
vmr

Reputation: 1935

Given two files find words in file1 but not in file2

Given two files come up with an algorithm/program to find words in file1, but not in file2.
Note that the words in file are not in sorted order.

This is my thought process:

This algorithm will work fine if the number of words in both the files is few 100s or 1000s.
However, if both the files are huge (billions of words), then this solution will not work, so I came up with an improved solution:

So the map will look like this ['a':{'ample','and'...}]. This will help me in searching for bucket within log(n) time complexity and then log(n) to find if the word is contained within the sorted list.

This solution will work, but i believe there is still scope for improvement.
How can this solution be improved further?

Upvotes: 1

Views: 271

Answers (1)

amit
amit

Reputation: 178481

A possible solution is to use some external sort to sort both files, and then iterate them in parallel to find a word that appear only in file 1:

Pseudo code (after sorting):

iter1 = 0
iter2 = 0
while iter1 < file1.length:
    if file1[iter1] == file2[iter2]:
         iter1 = iter1 + 1
         iter2 = iter2 + 1
    else if file1[iter1] > file2[iter2]:
         iter2 = iter2 + 1
    else: //we know for sure the item is only in file1
         iter1 = iter1 + 1
         yield file1[iter1]

This solution takes O(nlogn) time, and very little space needed (the amount of space needed for the external sort).

Also note, this problem is a variant of element distinctness problem, so most likely it has lower bound of Omega(nl0gn) when using comparisons based algorithm, or Omega(n) time + space for using hashing.

Upvotes: 2

Related Questions