Reputation: 1770
So I have a huuuuge file and a big list of items that I want to grep out from that file. For the sake of this example, let the files be denoted thus-
seq 1 10000 > file.txt #file.txt contains numbers from 1 to 10000
seq 1 5 10000 > list #list contains every fifth number from 1 to 10000
My question is, which is the best way to grep out the lines corresponding to 'list' from 'file.txt'
I tried it in two ways-
time while read i ; do grep -w "$i" file.txt ; done < list > output
That command took - real 0m1.300s
time grep -wf list file.txt > output
This one was slower, clocking in at- real 0m1.402s.
Is there a better (faster) way to do this? Is there a best way that I'm missing?
Upvotes: 0
Views: 196
Reputation: 827
You're comparing apples and oranges
this command greps words from list in file.txt
time for i in `cat list`; do grep -w "$i" file.txt ; done > output
this command greps patterns from file.txt in list
time grep -f file.txt list > output
you need to fix one file as the source of strings to match and the other file as the target data in which to match strings - also use the same grep options like -w or -F
It sounds like list is the source of patterns and file.txt is target datafile - here are my timings for the original adjusted commands plus one awk and two sed solutions - the sed solutions differ in whether the patterns are given as separate sed commands or one extended regex
timings
one grep
real 0m0.016s
user 0m0.001s
sys 0m0.001s
2000 output1
loop grep
real 0m10.120s
user 0m0.060s
sys 0m0.212s
2000 output2
awk
real 0m0.022s
user 0m0.007s
sys 0m0.000s
2000 output3
sed
real 0m4.260s
user 0m4.211s
sys 0m0.022s
2000 output4
sed -r
real 0m0.144s
user 0m0.085s
sys 0m0.047s
2000 output5
script
n=10000
seq 1 $n >file.txt
seq 1 5 $n >list
echo "one grep"
time grep -Fw -f list file.txt > output1
wc -l output1
echo "loop grep"
time for i in `cat list`; do grep -Fw "$i" file.txt ; done > output2
wc -l output2
echo "awk"
time awk 'ARGIND==1 {list[$1]; next} $1 in list' list file.txt >output3
wc -l output3
echo "sed"
sed 's/^/\/^/;s/$/$\/p/' list >list.sed
time sed -n -f list.sed file.txt >output4
wc -l output4
echo "sed -r"
tr '\n' '|' <list|sed 's/^/\/^(/;s/|$/)$\/p/' >list.sedr
time sed -nr -f list.sedr file.txt >output5
wc -l output5
Upvotes: 2
Reputation: 14979
You can try awk
:
awk 'NR==FNR{a[$1];next} $1 in a' file.txt list
In my system, awk
is faster than grep
with the sample data.
Test:
$ time grep -f file.txt list > out
real 0m1.231s
user 0m1.056s
sys 0m0.175s
$ time awk 'NR==FNR{a[$1];next} $1 in a' file.txt list > out1
real 0m0.068s
user 0m0.067s
sys 0m0.001s
Upvotes: 1
Reputation: 21965
Faster or not, you've useless use of cat
up there
Why not?
grep -f list file.txt # Aren't files meant other way
Or use a bit more customized awk
awk 'NR==FNR{a[$1];next} $1 in a{print $1;next}' list file.txt
Upvotes: 0