Reputation: 845
I have a very long list with hits from a HMMer search in the following form:
Query: Alvin_0001|ID:9263667| [L=454]
Description: chromosomal replication initiator protein DnaA [Allochromatium vinosum DSM 180]
Scores for complete sequence (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Model Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------
7.5e-150 497.8 0.2 9e-150 497.5 0.2 1.0 1 COG0593
8e-11 40.6 0.5 1.5e-10 39.7 0.5 1.6 1 COG1484
4.5e-07 28.1 0.2 6e-07 27.7 0.2 1.1 1 COG1373
2.5e-05 22.3 0.1 3.4e-05 21.8 0.1 1.4 1 COG1485
Query: Alvin_0005|ID:9265207| [L=334]
Description: hypothetical protein [Allochromatium vinosum DSM 180]
Scores for complete sequence (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Model Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------
------ inclusion threshold ------
0.018 13.4 12.9 0.068 11.5 3.6 2.2 2 COG3247
0.024 13.1 9.0 0.053 12.0 9.0 1.5 1 COG2246
0.046 12.4 7.3 0.049 12.4 5.3 1.8 1 COG2020
Query: Alvin_0004|ID:9265206| [L=154]
Description: hypothetical protein [Allochromatium vinosum DSM 180]
Scores for complete sequence (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Model Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------
[No hits detected that satisfy reporting thresholds]
This file contains so much information that I am not interested in, so I need a script that only outputs certain values, that is the line with Query:
and the first COG####
in the column Model
So as an expected output (tab delimited file would be the best):
Query: Alvin_0001|ID:9263667| [L=454] COG0593
Query: Alvin_0005|ID:9265207| [L=334] COG3247
Query: Alvin_0004|ID:9265206| [L=154]
note that in the last line, no COG has been found
Now the file structure is a bit too complicated for me to use a simple grep or awk command:
In the first block, the 1st and the 6st line would be the target (awk '/Query: /{nr[NR]; nr[NR+6]}; NR in nr'
)
In the second block, it is the 1st and the 7th line
and in the third, there is only the line with Query
So what would be now a good approach to parse this file?
Upvotes: 0
Views: 86
Reputation: 92854
Short awk solution:
awk '/^Query:/{ if(q) print q; q=$0 }q && $9~/^COG.{4}$/{ printf("%s\t%s\n",q,$9); q="" }
END{ if(q) print q }' file
The output:
Query: Alvin_0001|ID:9263667| [L=454] COG0593
Query: Alvin_0005|ID:9265207| [L=334] COG3247
Query: Alvin_0004|ID:9265206| [L=154]
Details:
/^Query:/{ q=$0 }
- capturing "Query" line
q && $9~/^COG.{4}$/
- capturing the first "Model" field value (ensured by resetting the preceding "Query" line q=""
)
Upvotes: 2
Reputation: 203189
$ cat tst.awk
BEGIN { OFS="\t" }
/^Query/ { qry=$0 }
$1 ~ /^[0-9]/ { if (qry!="") print qry, $9; qry="" }
/\[No hits/ { print qry }
$ awk -f tst.awk file
Query: Alvin_0001|ID:9263667| [L=454] COG0593
Query: Alvin_0005|ID:9265207| [L=334] COG3247
Query: Alvin_0004|ID:9265206| [L=154]
Upvotes: 1