Reputation: 785
I have two files of size 20Gb each. I have to search common string among them. Assume Maximum length of strings are 20bytes. so to solve this i am using following algorithm I am using a system of 8GB RAM with quad core i3 CPU.
sort the files using any suitable sorting utility
open files A and B for reading
read wordA from A
read wordB from B
while (A not EOF and B not EOF)
{
if (wordA < wordB)
read wordA from A
else if (wordA > wordB)
read wordB from B
else
/* match found, store it into some other files */
write wordA into output
read wordA from A
}
it went successfully for above mentioned system configuration BUT when i run this algorithm in a system of 6Gb RAM and available disk space of 120GB with 6 cores i3 processor... my system got shut downed many times. Why this is happening?
Plz tell me any other way to solve this plm! Can we improve it performance?
Upvotes: 1
Views: 504
Reputation: 140237
Yes, you can dramatically improve the performance using a very short awk
1-liner
awk 'FNR==NR{a[$0]++;next}a[$0]' file1 file2
With awk
you can find unique lines without having to first sort them. You didn't really say what you wanted to do with common lines so I just assumed you wanted to print them out.
If you only want to print out a common line once no matter how many times it repeats you can use this:
awk 'FNR==NR{a[$0]=1;next}a[$0]-- > 0' file1 file2
Upvotes: 3