joanaseneca
joanaseneca

Reputation: 11

print rows that match exact pattern using awk

I have a file with names, and I want to search for them in a huge tab delimited file and get the corresponding rows. I need to match the exact names. So I tried this grep option, where I search a big file using a the names file (-f) , and I specify that I want exact matches (-w).

grep -f names -w bigfile > outputfile

Because tab delimited file is quite big, the grep command takes too long, so I am looking for an awk alternative. So far I tried:

awk 'NR==FNR{a[$0]=$0}NR>FNR{if($2==a[$2])print $0}' names bigfile > outputfile

Here, I look for matches in column 2 of the big file and print the corresponding rows.

The problem is that this awk option is not specific. Does anyone have a suggestion on how to modify this one-liner so that it matches exact regular expressions?

EDIT

Here's an example:

let's say that the names file looks like this:

GN2a__NODE_1000349_length_1013_cov_0.309117_1
GN4a__NODE_1000349_length_1013_cov_0.303417_1
GN3a__NODE_1000343_length_1013_cov_0.309417_2
GN2a__NODE_1020349_length_1013_cov_0.109437_1
GN2a__NODE_1020349_length_1013_cov_0.109437_10
GO4a__NODE_9_length_201043_cov_2.340371_8
GO4a__NODE_9_length_201043_cov_2.340371_83

and the first 9 rows of a tab delimited bigfile that has > 8 million rows looks like this:

7001253F:563:CC1HJANXX:8:2210:1232:2187#CTGAAGCTCAGGACGT    GN2a__NODE_1000349_length_1013_cov_0.309117_1      
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN4a__NODE_1000349_length_1013_cov_0.303417_1  
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN3a__NODE_1000343_length_1013_cov_0.309417_2
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN2a__NODE_1020349_length_1013_cov_0.109437_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN2a__NODE_1020349_length_1013_cov_0.109437_10
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN2a__NODE_1020349_length_1013_cov_0.109437_14
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GO4a__NODE_9_length_201043_cov_2.340371_8
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GO4a__NODE_9_length_201043_cov_2.340371_83
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GO4a__NODE_9_length_201043_cov_2.340371_87

i would like my output to look like this:

7001253F:563:CC1HJANXX:8:2210:1232:2187#CTGAAGCTCAGGACGT    GN2a__NODE_1000349_length_1013_cov_0.309117_1     
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN4a__NODE_1000349_length_1013_cov_0.303417_1  
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN3a__NODE_1000343_length_1013_cov_0.309417_2
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN2a__NODE_1020349_length_1013_cov_0.109437_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN2a__NODE_1020349_length_1013_cov_0.109437_10
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GO4a__NODE_9_length_201043_cov_2.340371_8
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GO4a__NODE_9_length_201043_cov_2.340371_83

And instead with the awk one-liner, I get:

7001253F:563:CC1HJANXX:8:2210:1232:2187#CTGAAGCTCAGGACGT    GN2a__NODE_1000349_length_1013_cov_0.309117_1      
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN4a__NODE_1000349_length_1013_cov_0.303417_1  
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN3a__NODE_1000343_length_1013_cov_0.309417_2
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN2a__NODE_1020349_length_1013_cov_0.109437_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN2a__NODE_1020349_length_1013_cov_0.109437_10
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GO4a__NODE_9_length_201043_cov_2.340371_8
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GO4a__NODE_9_length_201043_cov_2.340371_83
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GO4a__NODE_9_length_201043_cov_2.340371_87

Upvotes: 1

Views: 654

Answers (4)

Ed Morton
Ed Morton

Reputation: 203995

This is the job that join exists to do:

$ join -t$'\t' -1 1 -2 2 -o '2.1 2.2' <(sort names) <(sort -k2,2 bigfile)
7001253F:563:CC1HJANXX:8:2210:1232:2187#CTGAAGCTCAGGACGT    GN2a__NODE_1000349_length_1013_cov_0.309117_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN2a__NODE_1020349_length_1013_cov_0.109437_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN2a__NODE_1020349_length_1013_cov_0.109437_10
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN3a__NODE_1000343_length_1013_cov_0.309417_2
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GN4a__NODE_1000349_length_1013_cov_0.303417_1
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GO4a__NODE_9_length_201043_cov_2.340371_8
7001253F:563:CC1HJANXX:8:2210:1968:2031#CTGAAGCTCAGGACGT    GO4a__NODE_9_length_201043_cov_2.340371_83

Upvotes: 1

karakfa
karakfa

Reputation: 67507

your explanation is not very clear. It seems like you're trying for an exact literal match, not regex.

$ awk -F'\t' 'NR==FNR{a[$1]; next} $2 in a' names bigfile > outputfile

should work even if you have spaces in the first field. However, I expect a literal grep match to be faster than this.

Note that this is almost the same as the other awk script.

Upvotes: 2

Zartaj Majeed
Zartaj Majeed

Reputation: 510

I don't see anything wrong with your one-liner though it could be simplified to this for a slight performance gain

awk 'NR==FNR{a[$0]} NR>FNR{if($2 in a)print $0}' names bigfile >outputfile

or bit more compact

awk 'NR==FNR{a[$0]} NR>FNR && $2 in a' names bigfile >outputfile

My guess is there are invisible control characters in your names file that are throwing off the output

I'm also surprised you found grep slower than awk. My tests with your data duplicated to 3 million lines show grep is 3x faster than awk

Are you running this on Linux? Is this GNU grep and awk?

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133610

EDIT: Since OP's samples were completely changed so adding this solution now, again based on shown samples only.

awk '
FNR==NR{
  a[$0]
  next
}
match($0,/[[:alnum:]]+__NODE_[0-9]+_length.*cov_[0-9]+\.[0-9]+_[0-9]+/) && (substr($0,RSTART,RLENGTH) in a)
' names Input_file


Could you please try following, written and tested on shown samples only in GNU awk.

awk '
FNR==NR{
  a[$0]
  next
}
match($0,/Gene[0-9]+/) && (substr($0,RSTART,RLENGTH) in a)
' names Input_file

Explanation: Adding detailed explanation for above.

awk '                                   ##Starting awk program from here.
FNR==NR{                                ##Checking condition FNR==NR which will be TRUE when first Input_file is being read.
  a[$0]                                 ##Creating array a with index of current line here.
  next                                  ##next will skip all further statement from here.
}
match($0,/Gene[0-9]+/) && (substr($0,RSTART,RLENGTH) in a)    ##Matching regex of Gene digits(which are coing together) AND making sure this substring(which is actually matched value by regex) is present in names file array then print current line.
' names  Input_file                     ##Mentioning Input_file names here.

Upvotes: 2

Related Questions