generating Diff between two asymmetric files in Bash

Question

I have a big text file biggerFile with 2M entries and another text file smaller with 1M entires..

all the entries in smaller file File2 are there in the File1

the format of entries in bigger file is ..

helloworld_12345_987654312.zip
helloWorld_12344_987654313.zip
helloWOrld_12346_987654314.zip

smaller file contains data like

987654312
987654313

i.e the last part of the file name before the file extension .zip, could someone give any pointers how could i achieve this

my attempt was to run a loop over smaller file and do a grep on larger file and keep on deleting the entries if the file is found in larger file .. so at the end of the process i will have the missing entries left in the file.

though this solution works but its inefficient and crude.. can some one suggest a better approach for this problem

Lesmana · Accepted Answer

Grep has a switch -f which reads the patterns from a file. Combine that with -v which prints only lines which do not match and you have an elegant solution. Since your patterns are fixed strings you can increase performance dramatically when you use -F.

grep -F -v -f smallfile bigfile

I wrote a python script to generate some test data:

bigfile = open('bigfile', 'w')
smallfile = open('smallfile', 'w')

count = 2000000
start = 1000000

for i in range(start, start + count):
  bigfile.write('foo' + str(i) + 'bar
')
  if i % 2:
    smallfile.write(str(i) + '
')

bigfile.close()
smallfile.close()

Here are some tests I ran using only 2000 lines (set count to 2000) because for more lines the time required to run grep without -F was getting ridiculous.

$ time grep -v -f smallfile bigfile > /dev/null

real    0m3.075s
user    0m2.996s
sys 0m0.028s

$ time grep -F -v -f smallfile bigfile > /dev/null

real    0m0.011s
user    0m0.000s
sys 0m0.012s

Grep also has a --mmap switch which might increase performance according to the man page. In my test there was no performance increase.

For these tests I used 2 million lines.

$ time grep -F -v -f smallfile bigfile > /dev/null

real    0m3.900s
user    0m3.736s
sys 0m0.104s

$ time grep -F --mmap -v -f smallfile bigfile > /dev/null

real    0m3.911s
user    0m3.728s
sys 0m0.128s

generating Diff between two asymmetric files in Bash

Answers (2)

Related Questions