Reputation: 11
I have a file with names, and I want to search for them in a huge tab delimited file and get the corresponding rows. I need to match the exact names. So I tried this grep option, where I search a big file using a the names file (-f) , and I specify that I want exact matches (-w).
grep -f names -w bigfile > outputfile
Because tab delimited file is quite big, the grep command takes too long, so I am looking for an awk alternative. So far I tried:
awk 'NR==FNR{a[$0]=$0}NR>FNR{if($2==a[$2])print $0}' names bigfile > outputfile
Here, I look for matches in column 2 of the big file and print the corresponding rows.
The problem is that this awk option is not specific. Does anyone have a suggestion on how to modify this one-liner so that it matches exact regular expressions?
EDIT
Here's an example:
let's say that the names file looks like this:
GN2a__NODE_1000349_length_1013_cov_0.309117_1
GN4a__NODE_1000349_length_1013_cov_0.303417_1
GN3a__NODE_1000343_length_1013_cov_0.309417_2
GN2a__NODE_1020349_length_1013_cov_0.109437_1
GN2a__NODE_1020349_length_1013_cov_0.109437_10
GO4a__NODE_9_length_201043_cov_2.340371_8
GO4a__NODE_9_length_201043_cov_2.340371_83
and the first 9 rows of a tab delimited bigfile that has > 8 million rows looks like this:
7001253F:563:CC1HJANXX:8:2210:1232:2187#CTGAAGCTCAGGACGT GN2a__NODE_1000349_length_1013_cov_0.309117_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN4a__NODE_1000349_length_1013_cov_0.303417_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN3a__NODE_1000343_length_1013_cov_0.309417_2
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN2a__NODE_1020349_length_1013_cov_0.109437_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN2a__NODE_1020349_length_1013_cov_0.109437_10
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN2a__NODE_1020349_length_1013_cov_0.109437_14
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GO4a__NODE_9_length_201043_cov_2.340371_8
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GO4a__NODE_9_length_201043_cov_2.340371_83
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GO4a__NODE_9_length_201043_cov_2.340371_87
i would like my output to look like this:
7001253F:563:CC1HJANXX:8:2210:1232:2187#CTGAAGCTCAGGACGT GN2a__NODE_1000349_length_1013_cov_0.309117_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN4a__NODE_1000349_length_1013_cov_0.303417_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN3a__NODE_1000343_length_1013_cov_0.309417_2
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN2a__NODE_1020349_length_1013_cov_0.109437_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN2a__NODE_1020349_length_1013_cov_0.109437_10
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GO4a__NODE_9_length_201043_cov_2.340371_8
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GO4a__NODE_9_length_201043_cov_2.340371_83
And instead with the awk one-liner, I get:
7001253F:563:CC1HJANXX:8:2210:1232:2187#CTGAAGCTCAGGACGT GN2a__NODE_1000349_length_1013_cov_0.309117_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN4a__NODE_1000349_length_1013_cov_0.303417_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN3a__NODE_1000343_length_1013_cov_0.309417_2
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN2a__NODE_1020349_length_1013_cov_0.109437_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN2a__NODE_1020349_length_1013_cov_0.109437_10
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GO4a__NODE_9_length_201043_cov_2.340371_8
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GO4a__NODE_9_length_201043_cov_2.340371_83
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GO4a__NODE_9_length_201043_cov_2.340371_87
Upvotes: 1
Views: 654
Reputation: 203995
This is the job that join
exists to do:
$ join -t$'\t' -1 1 -2 2 -o '2.1 2.2' <(sort names) <(sort -k2,2 bigfile)
7001253F:563:CC1HJANXX:8:2210:1232:2187#CTGAAGCTCAGGACGT GN2a__NODE_1000349_length_1013_cov_0.309117_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN2a__NODE_1020349_length_1013_cov_0.109437_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN2a__NODE_1020349_length_1013_cov_0.109437_10
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN3a__NODE_1000343_length_1013_cov_0.309417_2
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GN4a__NODE_1000349_length_1013_cov_0.303417_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GO4a__NODE_9_length_201043_cov_2.340371_8
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT GO4a__NODE_9_length_201043_cov_2.340371_83
Upvotes: 1
Reputation: 67507
your explanation is not very clear. It seems like you're trying for an exact literal match, not regex.
$ awk -F'\t' 'NR==FNR{a[$1]; next} $2 in a' names bigfile > outputfile
should work even if you have spaces in the first field. However, I expect a literal grep match to be faster than this.
Note that this is almost the same as the other awk
script.
Upvotes: 2
Reputation: 510
I don't see anything wrong with your one-liner though it could be simplified to this for a slight performance gain
awk 'NR==FNR{a[$0]} NR>FNR{if($2 in a)print $0}' names bigfile >outputfile
or bit more compact
awk 'NR==FNR{a[$0]} NR>FNR && $2 in a' names bigfile >outputfile
My guess is there are invisible control characters in your names file that are throwing off the output
I'm also surprised you found grep slower than awk. My tests with your data duplicated to 3 million lines show grep is 3x faster than awk
Are you running this on Linux? Is this GNU grep and awk?
Upvotes: 2
Reputation: 133610
EDIT: Since OP's samples were completely changed so adding this solution now, again based on shown samples only.
awk '
FNR==NR{
a[$0]
next
}
match($0,/[[:alnum:]]+__NODE_[0-9]+_length.*cov_[0-9]+\.[0-9]+_[0-9]+/) && (substr($0,RSTART,RLENGTH) in a)
' names Input_file
Could you please try following, written and tested on shown samples only in GNU awk
.
awk '
FNR==NR{
a[$0]
next
}
match($0,/Gene[0-9]+/) && (substr($0,RSTART,RLENGTH) in a)
' names Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file is being read.
a[$0] ##Creating array a with index of current line here.
next ##next will skip all further statement from here.
}
match($0,/Gene[0-9]+/) && (substr($0,RSTART,RLENGTH) in a) ##Matching regex of Gene digits(which are coing together) AND making sure this substring(which is actually matched value by regex) is present in names file array then print current line.
' names Input_file ##Mentioning Input_file names here.
Upvotes: 2