Reputation: 5225

Find common words from two files

Given two files containing list of words(around million), We need to find out the words that are in common.

Use Some efficient algorithm, also not enough memory availble(1 million, certainly not).. Some basic C Programming code, if possible, would help.

The files are not sorted.. We can use some sort of algorithm... Please support it with basic code...

Sorting the external file...... with minimum memory available,, how can it be implement with C programming.

Anybody game for external sorting of a file... Please share some code for this.

Upvotes: 1

Answers (5)

davka

Reputation: 14702

Yet another approach.

General. first, notice that doing this sequentially takes O(N^2). With N=1,000,000, this is a LOT. Sorting each list would take O(N*log(N)); then you can find the intersection in one pass by merging the files (see below). So the total is O(2N*log(N) + 2N) = O(N*log(N)).

Sorting a file. Now let's address the fact that working with files is much slower than with memory, especially when sorting where you need to move things around. One way to solve this is - decide the size of the chunk that can be loaded into memory. Load the file one chunk at a time, sort it efficiently and save into a separate temporary file. The sorted chunks can be merged (again, see below) into one sorted file in one pass.

Merging. When you have 2 sorted lists (files or not), you can merge them into one sorted list easily in one pass: have 2 "pointers", initially pointing to the first entry in each list. In each step, compare the values the pointers point to. Move the smaller value to the merged list (the one you are constructing) and advance its pointer.

You can modify the merge algorithm easily to make it find the intersection - if pointed values are equal move it to the results (consider how do you want to deal with duplicates).

For merging more than 2 lists (as in sorting the file above) you can generalize the algorithm for using k pointers.

Upvotes: 3

theheadofabroom

Reputation: 22109

If you're looking for memory efficiency with this sort of thing you'll be hard pushed to get time efficiency. My example will be written in python, but should be relatively easy to implement in any language.

with open(file1) as file_1:
  current_word_1 = read_to_delim(file_1, delim)
  while current_word_1:
    with open(file2) as file_2:
      current_word_2 = read_to_delim(file_2, delim)
      while current_word_2:
        if current_word_2 == current_word_1:
          print current_word_2
        current_word_2 = read_to_delim(file_2, delim)
    current_word_1 = read_to_delim(file_1, delim)

I leave read_to_delim to you, but this is the extreme case that is memory-optimal but time-least-optimal.

depending on your application of course you could load the two files in a database, perform a left outer join, and discard the rows for which one of the two columns is null

Upvotes: 0

Vojislav Stojkovic

Reputation: 8153

I would give prefix trees (aka tries) a shot.

My initial approach would be to determine a maximum depth for the trie that would fit nicely within my RAM limits. Pick an arbitrary depth (say 3, you can tweak it later) and construct a trie up to that depth, for the smaller file. Each leaf would be a list of "file pointers" to words that start with the prefix encoded by the path you followed to reach the leaf. These "file pointers" would keep an offset into the file and the word length.

Then process the second file by reading each word from it and trying to find it in the first file using the trie you constructed. It would allow you to fail faster on words that don't match. The deeper your trie, the faster you can fail, but the more memory you would consume.

Of course, like Stephen Chung said, you still need RAM to store enough information to describe at least one of the files, if you really need an efficient algorithm. If you don't have enough memory -- and you probably don't, because I estimate my approach would require approximately the same amount of memory you would need to load a file whose words were 14-22 characters long -- then you have to process even the first file by parts. In that case, I would actually recommend using the trie for the larger file, not the smaller. Just partition it in parts that are no bigger than the smaller file (or no bigger than your RAM constraints allow, really) and do the whole process I described for each part.

Despite the length, this is sort of off the top of my head. I might be horribly wrong in some details, but this is how I would initially approach the problem and then see where it would take me.

Upvotes: 1

Stephen Chung

Reputation: 14605

Your problem is: Given two sets of items, find the intersaction (items common to both), while staying within the constraints of inadequate RAM (less than the size of any set).

Since finding an intersaction requires comparing/searching each item in another set, you must have enough RAM to store at least one of the sets (the smaller one) to have an efficient algorithm.

Assume that you know for a fact that the intersaction is much smaller than both sets and fits completely inside available memory -- otherwise you'll have to do further work in flushing the results to disk.

If you are working under memory constraints, partition the larger set into parts that fit inside 1/3 of the available memory. Then partition the smaller set into parts the fit the second 1/3. The remaining 1/3 memory is used to store the results.

Optimize by finding the max and min of the partition for the larger set. This is the set that you are comparing from. Then when loading the corresponding partition of the smaller set, skip all items outside the min-max range.

First find the intersaction of both partitions through a double-loop, storing common items to the results set and removing them from the original sets to save on comparisons further down the loop.

Then replace the partition in the smaller set with the second partition (skipping items outside the min-max). Repeat. Notice that the partition in the larger set is reduced -- with common items already removed.

After running through the entire smaller set, repeat with the next partition of the larger set.

Now, if you do not need to preserve the two original sets (e.g. you can overwrite both files), then you can further optimize by removing common items from disk as well. This way, those items no longer need to be compared in further partitions. You then partition the sets by skipping over removed ones.

Upvotes: 1

Doc Brown

Reputation: 20054

If you had enough memory to read the first file completely into RAM, I would suggest reading it into a dictionary (word -> index of that word ), loop over the words of the second file and test if the word is contained in that dictionary. Memory for a million words is not much today.

If you have not enough memory, split the first file into chunks that fit into memory and do as I said above for each of that chunk. For example, fill the dictionary with the first 100.000 words, find every common word for that, then read the file a second time extracting word 100.001 up to 200.000, find the common words for that part, and so on.

And now the hard part: you need a dictionary structure, and you said "basic C". When you are willing to use "basic C++", there is the hash_map data structure provided as an extension to the standard library by common compiler vendors. In basic C, you should also try to use a ready-made library for that, read this SO post to find a link to a free library which seems to support that.

Upvotes: 2

Find common words from two files

Answers (5)

Related Questions