Chem-man17
Chem-man17

Reputation: 1770

Different ways of grepping for large amounts of data

So I have a huuuuge file and a big list of items that I want to grep out from that file. For the sake of this example, let the files be denoted thus-

seq 1 10000 > file.txt          #file.txt contains numbers from 1 to 10000
seq 1 5 10000 > list            #list contains every fifth number from 1 to 10000

My question is, which is the best way to grep out the lines corresponding to 'list' from 'file.txt'

I tried it in two ways-

time while read i ; do grep -w "$i" file.txt ; done < list > output

That command took - real 0m1.300s

time grep -wf list file.txt > output

This one was slower, clocking in at- real 0m1.402s.

Is there a better (faster) way to do this? Is there a best way that I'm missing?

Upvotes: 0

Views: 196

Answers (3)

pakistanprogrammerclub
pakistanprogrammerclub

Reputation: 827

You're comparing apples and oranges

this command greps words from list in file.txt

time for i in `cat list`; do grep -w "$i" file.txt ; done > output

this command greps patterns from file.txt in list

time grep -f file.txt list > output

you need to fix one file as the source of strings to match and the other file as the target data in which to match strings - also use the same grep options like -w or -F

It sounds like list is the source of patterns and file.txt is target datafile - here are my timings for the original adjusted commands plus one awk and two sed solutions - the sed solutions differ in whether the patterns are given as separate sed commands or one extended regex

timings

one grep
real    0m0.016s
user    0m0.001s
sys     0m0.001s
2000 output1

loop grep
real    0m10.120s
user    0m0.060s
sys     0m0.212s
2000 output2

awk
real    0m0.022s
user    0m0.007s
sys     0m0.000s
2000 output3

sed
real    0m4.260s
user    0m4.211s
sys     0m0.022s
2000 output4

sed -r
real    0m0.144s
user    0m0.085s
sys     0m0.047s
2000 output5

script

n=10000
seq 1 $n >file.txt             
seq 1 5 $n >list               

echo "one grep"
time grep -Fw -f list file.txt > output1
wc -l output1

echo "loop grep"
time for i in `cat list`; do grep -Fw "$i" file.txt ; done > output2
wc -l output2

echo "awk"
time awk 'ARGIND==1 {list[$1]; next} $1 in list' list file.txt >output3
wc -l output3

echo "sed"
sed 's/^/\/^/;s/$/$\/p/' list >list.sed
time sed -n -f list.sed file.txt >output4
wc -l output4

echo "sed -r"
tr '\n' '|' <list|sed 's/^/\/^(/;s/|$/)$\/p/' >list.sedr
time sed -nr -f list.sedr file.txt >output5
wc -l output5

Upvotes: 2

sat
sat

Reputation: 14979

You can try awk:

awk 'NR==FNR{a[$1];next} $1 in a' file.txt list

In my system, awk is faster than grep with the sample data.

Test:

$ time grep -f file.txt list > out

real    0m1.231s
user    0m1.056s
sys     0m0.175s

$ time awk 'NR==FNR{a[$1];next} $1 in a' file.txt list > out1

real    0m0.068s
user    0m0.067s
sys     0m0.001s

Upvotes: 1

sjsam
sjsam

Reputation: 21965

Faster or not, you've useless use of cat up there Why not?

grep -f list file.txt   # Aren't files meant other way

Or use a bit more customized awk

awk 'NR==FNR{a[$1];next} $1 in a{print $1;next}' list file.txt

Upvotes: 0

Related Questions