Reputation: 5460
I have a big text file biggerFile with 2M entries and another text file smaller with 1M entires..
all the entries in smaller file File2 are there in the File1
the format of entries in bigger file is ..
helloworld_12345_987654312.zip
helloWorld_12344_987654313.zip
helloWOrld_12346_987654314.zip
smaller file contains data like
987654312
987654313
i.e the last part of the file name before the file extension .zip, could someone give any pointers how could i achieve this
my attempt was to run a loop over smaller file and do a grep on larger file and keep on deleting the entries if the file is found in larger file .. so at the end of the process i will have the missing entries left in the file.
though this solution works but its inefficient and crude.. can some one suggest a better approach for this problem
Upvotes: 2
Views: 137
Reputation: 27073
Grep has a switch -f
which reads the patterns from a file. Combine that with -v
which prints only lines which do not match and you have an elegant solution. Since your patterns are fixed strings you can increase performance dramatically when you use -F
.
grep -F -v -f smallfile bigfile
I wrote a python script to generate some test data:
bigfile = open('bigfile', 'w')
smallfile = open('smallfile', 'w')
count = 2000000
start = 1000000
for i in range(start, start + count):
bigfile.write('foo' + str(i) + 'bar\n')
if i % 2:
smallfile.write(str(i) + '\n')
bigfile.close()
smallfile.close()
Here are some tests I ran using only 2000 lines (set count to 2000) because for more lines the time required to run grep without -F
was getting ridiculous.
$ time grep -v -f smallfile bigfile > /dev/null
real 0m3.075s
user 0m2.996s
sys 0m0.028s
$ time grep -F -v -f smallfile bigfile > /dev/null
real 0m0.011s
user 0m0.000s
sys 0m0.012s
Grep also has a --mmap
switch which might increase performance according to the man page. In my test there was no performance increase.
For these tests I used 2 million lines.
$ time grep -F -v -f smallfile bigfile > /dev/null
real 0m3.900s
user 0m3.736s
sys 0m0.104s
$ time grep -F --mmap -v -f smallfile bigfile > /dev/null
real 0m3.911s
user 0m3.728s
sys 0m0.128s
Upvotes: 2
Reputation: 123658
Use grep
. You can specify the smaller file as the one to obtain the patterns from (using -f filename
) and do a -v
to obtain lines not matching the pattern.
Since your patterns appear fixed, you can also supply the -F
option that would speed up grep
.
The following should be self-explanatory:
$ cat big
helloworld_12345_987654312.zip
helloWorld_12344_987654313.zip
helloWOrld_12346_987654314.zip
$ cat small
987654312
987654313
$ grep -F -f small big # Find lines matching those in the smaller file
helloworld_12345_987654312.zip
helloWorld_12344_987654313.zip
$ grep -F -v -f small big # Eliminate lines matching those in the smaller file
helloWOrld_12346_987654314.zip
Upvotes: 1