Reputation: 3022
I am trying to use awk
to count the headers and use those as field numbers. My problem is two fold:
The awk
is close, but I need some expert help to make it better. Thank you :).
the awk
as is ignores the field headers and defines the fields using the text (sometimes field 5 starts with NM_, other times it is LRG_) as the RefSeqGene.txt illustrates. I think that is because not all the fields have text, but what is consistent are the headers.
I only want to pull the row where $10
= "reference standard"
awk
awk 'FNR==NR {E[$1]; next }$3 in E {print $3, $5}' panel_genes.txt RefSeqGene.txt > update.txt
example of panel genes.txt (used to search RefSeqGene.txt)
ACTA2
BRAF
BHLHB9
example of RefSeqGene.txt
#tax_id GeneID Symbol RSG LRG RNA t Protein p Category
9606 59 ACTA2 NG_011541.1 NM_001613.2 NP_001604.1 reference standard
9606 59 ACTA2 NG_011541.1 NM_001141945.1 NP_001135417.1 reference standard
9606 673 BRAF NG_007873.3 LRG_299 NM_004333.4 t1 NP_004324.2 p1 reference standard
9606 80823 BHLHB9 NG_021340.1 NM_001142524.1 NP_001135996.1 aligned
9606 80823 BHLHB9 NG_021340.1 NM_001142525.1 NP_001135997.1 aligned
9606 80823 BHLHB9 NG_021340.1 NM_001142526.1 NP_001135998.1 aligned
desired output
ACTA2 NM_001613.2
ACTA2 NM_001141945.1
BRAF NM_004333.4
Upvotes: 0
Views: 54
Reputation: 195079
this one-liner gives your the desired output:
awk 'FNR==NR{a[$0];next}
$(NF-1)$NF=="referencestandard" && $3 in a{print $3, ($5~/^NM_/?$5:$6)}' file1 file2
$(NF-1)$NF=="referencestandard"
checks your $10
$5
begins with NM_
we take it, otherwise, we take the $6
Upvotes: 2