searching of common strings in two given input files

Question

I have two files of size 20Gb each. I have to search common string among them. Assume Maximum length of strings are 20bytes. so to solve this i am using following algorithm I am using a system of 8GB RAM with quad core i3 CPU.

sort the files using any suitable sorting utility
open files A and B for reading
read wordA from A
read wordB from B
while (A not EOF and B not EOF)
{
    if (wordA < wordB)
        read wordA from A
    else if (wordA > wordB)
        read wordB from B
    else
        /* match found, store it into some other files */
        write wordA into output
        read wordA from A
}

it went successfully for above mentioned system configuration BUT when i run this algorithm in a system of 6Gb RAM and available disk space of 120GB with 6 cores i3 processor... my system got shut downed many times. Why this is happening?

Plz tell me any other way to solve this plm! Can we improve it performance?

SiegeX · Accepted Answer

Yes, you can dramatically improve the performance using a very short awk 1-liner

awk 'FNR==NR{a[$0]++;next}a[$0]' file1 file2

With awk you can find unique lines without having to first sort them. You didn't really say what you wanted to do with common lines so I just assumed you wanted to print them out.

If you only want to print out a common line once no matter how many times it repeats you can use this:

awk 'FNR==NR{a[$0]=1;next}a[$0]-- > 0' file1 file2

searching of common strings in two given input files

Answers (1)

Related Questions