rororo
rororo

Reputation: 845

Return values of specific fields of a file (bash)

I have a very long list with hits from a HMMer search in the following form:

Query:       Alvin_0001|ID:9263667|  [L=454]
Description: chromosomal replication initiator protein DnaA [Allochromatium vinosum DSM 180]
Scores for complete sequence (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Model    Description
    ------- ------ -----    ------- ------ -----   ---- --  -------- -----------
   7.5e-150  497.8   0.2     9e-150  497.5   0.2    1.0  1  COG0593
      8e-11   40.6   0.5    1.5e-10   39.7   0.5    1.6  1  COG1484
    4.5e-07   28.1   0.2      6e-07   27.7   0.2    1.1  1  COG1373
    2.5e-05   22.3   0.1    3.4e-05   21.8   0.1    1.4  1  COG1485

Query:       Alvin_0005|ID:9265207|  [L=334]
Description: hypothetical protein [Allochromatium vinosum DSM 180]
Scores for complete sequence (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Model    Description
    ------- ------ -----    ------- ------ -----   ---- --  -------- -----------
  ------ inclusion threshold ------
      0.018   13.4  12.9      0.068   11.5   3.6    2.2  2  COG3247
      0.024   13.1   9.0      0.053   12.0   9.0    1.5  1  COG2246
      0.046   12.4   7.3      0.049   12.4   5.3    1.8  1  COG2020

Query:       Alvin_0004|ID:9265206|  [L=154]
Description: hypothetical protein [Allochromatium vinosum DSM 180]
Scores for complete sequence (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Model    Description
    ------- ------ -----    ------- ------ -----   ---- --  -------- -----------

   [No hits detected that satisfy reporting thresholds]

This file contains so much information that I am not interested in, so I need a script that only outputs certain values, that is the line with Query: and the first COG#### in the column Model

So as an expected output (tab delimited file would be the best):

Query:       Alvin_0001|ID:9263667|  [L=454]    COG0593
Query:       Alvin_0005|ID:9265207|  [L=334]    COG3247
Query:       Alvin_0004|ID:9265206|  [L=154]    

note that in the last line, no COG has been found

Now the file structure is a bit too complicated for me to use a simple grep or awk command: In the first block, the 1st and the 6st line would be the target (awk '/Query: /{nr[NR]; nr[NR+6]}; NR in nr') In the second block, it is the 1st and the 7th line and in the third, there is only the line with Query

So what would be now a good approach to parse this file?

Upvotes: 0

Views: 86

Answers (2)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Short awk solution:

awk '/^Query:/{ if(q) print q; q=$0 }q && $9~/^COG.{4}$/{ printf("%s\t%s\n",q,$9); q="" }
     END{ if(q) print q }' file

The output:

Query:       Alvin_0001|ID:9263667|  [L=454]    COG0593
Query:       Alvin_0005|ID:9265207|  [L=334]    COG3247
Query:       Alvin_0004|ID:9265206|  [L=154]

Details:

  • /^Query:/{ q=$0 } - capturing "Query" line

  • q && $9~/^COG.{4}$/ - capturing the first "Model" field value (ensured by resetting the preceding "Query" line q="")

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 203189

$ cat tst.awk
BEGIN { OFS="\t" }
/^Query/ { qry=$0 }
$1 ~ /^[0-9]/ { if (qry!="") print qry, $9; qry="" }
/\[No hits/   { print qry }

$ awk -f tst.awk file
Query:       Alvin_0001|ID:9263667|  [L=454]    COG0593
Query:       Alvin_0005|ID:9265207|  [L=334]    COG3247
Query:       Alvin_0004|ID:9265206|  [L=154]

Upvotes: 1

Related Questions