dpsdce
dpsdce

Reputation: 5460

generating Diff between two asymmetric files in Bash

I have a big text file biggerFile with 2M entries and another text file smaller with 1M entires..

all the entries in smaller file File2 are there in the File1

the format of entries in bigger file is ..

helloworld_12345_987654312.zip
helloWorld_12344_987654313.zip
helloWOrld_12346_987654314.zip

smaller file contains data like

987654312
987654313

i.e the last part of the file name before the file extension .zip, could someone give any pointers how could i achieve this

my attempt was to run a loop over smaller file and do a grep on larger file and keep on deleting the entries if the file is found in larger file .. so at the end of the process i will have the missing entries left in the file.

though this solution works but its inefficient and crude.. can some one suggest a better approach for this problem

Upvotes: 2

Views: 137

Answers (2)

Lesmana
Lesmana

Reputation: 27073

Grep has a switch -f which reads the patterns from a file. Combine that with -v which prints only lines which do not match and you have an elegant solution. Since your patterns are fixed strings you can increase performance dramatically when you use -F.

grep -F -v -f smallfile bigfile

I wrote a python script to generate some test data:

bigfile = open('bigfile', 'w')
smallfile = open('smallfile', 'w')

count = 2000000
start = 1000000

for i in range(start, start + count):
  bigfile.write('foo' + str(i) + 'bar\n')
  if i % 2:
    smallfile.write(str(i) + '\n')

bigfile.close()
smallfile.close()

Here are some tests I ran using only 2000 lines (set count to 2000) because for more lines the time required to run grep without -F was getting ridiculous.

$ time grep -v -f smallfile bigfile > /dev/null

real    0m3.075s
user    0m2.996s
sys 0m0.028s

$ time grep -F -v -f smallfile bigfile > /dev/null

real    0m0.011s
user    0m0.000s
sys 0m0.012s

Grep also has a --mmap switch which might increase performance according to the man page. In my test there was no performance increase.

For these tests I used 2 million lines.

$ time grep -F -v -f smallfile bigfile > /dev/null

real    0m3.900s
user    0m3.736s
sys 0m0.104s

$ time grep -F --mmap -v -f smallfile bigfile > /dev/null

real    0m3.911s
user    0m3.728s
sys 0m0.128s

Upvotes: 2

devnull
devnull

Reputation: 123658

Use grep. You can specify the smaller file as the one to obtain the patterns from (using -f filename) and do a -v to obtain lines not matching the pattern.

Since your patterns appear fixed, you can also supply the -F option that would speed up grep.

The following should be self-explanatory:

$ cat big 
helloworld_12345_987654312.zip
helloWorld_12344_987654313.zip
helloWOrld_12346_987654314.zip
$ cat small 
987654312
987654313
$ grep -F -f small big      # Find lines matching those in the smaller file
helloworld_12345_987654312.zip
helloWorld_12344_987654313.zip
$ grep -F -v -f small big   # Eliminate lines matching those in the smaller file
helloWOrld_12346_987654314.zip

Upvotes: 1

Related Questions