justaguy
justaguy

Reputation: 3022

awk to use header field to count fields

I am trying to use awk to count the headers and use those as field numbers. My problem is two fold:

The awk is close, but I need some expert help to make it better. Thank you :).

  1. the awk as is ignores the field headers and defines the fields using the text (sometimes field 5 starts with NM_, other times it is LRG_) as the RefSeqGene.txt illustrates. I think that is because not all the fields have text, but what is consistent are the headers.

  2. I only want to pull the row where $10 = "reference standard"

awk

awk 'FNR==NR {E[$1]; next }$3 in E {print $3, $5}' panel_genes.txt     RefSeqGene.txt > update.txt

example of panel genes.txt (used to search RefSeqGene.txt)

ACTA2
BRAF
BHLHB9

example of RefSeqGene.txt

#tax_id GeneID  Symbol  RSG LRG RNA t   Protein p   Category
9606    59  ACTA2   NG_011541.1     NM_001613.2     NP_001604.1     reference standard
9606    59  ACTA2   NG_011541.1     NM_001141945.1      NP_001135417.1      reference standard
9606    673 BRAF    NG_007873.3 LRG_299 NM_004333.4 t1  NP_004324.2 p1  reference standard
9606    80823   BHLHB9  NG_021340.1     NM_001142524.1      NP_001135996.1      aligned
9606    80823   BHLHB9  NG_021340.1     NM_001142525.1      NP_001135997.1      aligned
9606    80823   BHLHB9  NG_021340.1     NM_001142526.1      NP_001135998.1      aligned

desired output

ACTA2     NM_001613.2   
ACTA2     NM_001141945.1
BRAF      NM_004333.4

Upvotes: 0

Views: 54

Answers (1)

Kent
Kent

Reputation: 195079

this one-liner gives your the desired output:

 awk 'FNR==NR{a[$0];next}
     $(NF-1)$NF=="referencestandard" && $3 in a{print $3, ($5~/^NM_/?$5:$6)}' file1 file2
  • $(NF-1)$NF=="referencestandard" checks your $10
  • if $5 begins with NM_ we take it, otherwise, we take the $6

Upvotes: 2

Related Questions