Reputation: 31
I am dealing with a text file and each record in the file is separated by blank line. I want to extract the records which meats certain criteria.
For example, my text file looks like this
#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
11477 11043 single- 4 6 {SNAP_model.scaffold6_size143996-snap.2;SNAP
#EVM prediction: Mode:STANDARD S-ratio: 1.00 20968-21183 orient(+) score(432.00)
20968 21183 single+ 1 3 {GeneID_mRNA_scaffold6_size143996_6;GeneID}
#EVM prediction: Mode:STANDARD S-ratio: 1.00 21940-22362 orient(-) score(846.00)
22362 21940 single- 4 6 {GeneID_mRNA_scaffold6_size143996_7;GeneID}
#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
33363 33495 initial+ 1 1 {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
33496 33611 INTRON {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33612 33741 internal+ 2 2 {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33742 33842 INTRON {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33843 34677 terminal+ 3 3 {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36)
46879 46394 terminal- 4 6 {GeneID_mRNA_scaffold6_size143996_13;GeneID}
47512 46880 INTRON {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48256 47513 internal- 4 6 {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48366 48257 INTRON {Augustus_model.g41.t1;Augustus}
48429 48367 internal- 4 6 {Augustus_model.g41.t1;Augustus}
48510 48430 INTRON {Augustus_model.g41.t1;Augustus}
48564 48511 initial- 4 6 {Augustus_model.g41.t1;Augustus}
Now, I want to extract the records with score greater 1000. I want to remove second and third record which has sccore-432 score(432.00)
and score-846 score(846.00)
I have written awk code
awk -F '[()]' '{if ($4 > 1000) print $0}' input.out
but it is giving only first line as output. i.e
#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36)
But I want to extract complete record corresponding to the score greater than 1000. Please help to extract complete record
Upvotes: 3
Views: 169
Reputation: 212248
awk '{split($8,k,"[()]")} k[2]> 1000' RS= ORS='\n\n' input
When RS is set to the empty string, awk will treat a blank line as the record separater. We set ORS to two newlines so that blank lines are retained in the output. This solution simply splits the 8th field on the parentheses. According to the problem description, that field is expected to be a string of the form score(N)
(if that condition on the input is not met, some error checking should be added.) By splitting on that filed, we get N
in k[2], so we simply check if that value is greater than 1000. When it is, the default rule to print the record is applied.
Upvotes: 0
Reputation: 133528
EDIT: As per OP's comment, adding edited code here.
awk '
/^#EVM/{ found=="" }
match($0,/score\([0-9]+\.[0-9]+\)/){
found=1
val=substr($0,RSTART,RLENGTH)
gsub(/.*\(|)$/,"",val)
if(val+0>1000){ print; next }
}
found
' Input_file
Could you please try following. Written and tested with shown samples in GNU awk
.
awk '
match($0,/score\([0-9]+\.[0-9]+\)/){
val=substr($0,RSTART,RLENGTH)
gsub(/.*\(|)$/,"",val)
if(val+0>1000){ print }
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/score\([0-9]+\.[0-9]+\)/){ ##Using match function to match regex of (digits DOT digits ) in current line.
val=substr($0,RSTART,RLENGTH) ##Creating sub string of matched regex above and storing it to val here.
gsub(/.*\(|)$/,"",val) ##Globally substituting everything till ( and ) at last of line with NULL in val here.
if(val+0>1000){ print } ##If val is greater than 1000 then print line.
}' Input_file ##Mentioning Input_file name here.
Upvotes: 1
Reputation: 785196
You may use this awk
with an empty RS
and match
function:
awk -v RS= 'match($0, /score\([^)]+\)/) && substr($0, RSTART+6, RLENGTH-7)+0 > 1000 {ORS = RT; print}' file
#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
11477 11043 single- 4 6 {SNAP_model.scaffold6_size143996-snap.2;SNAP
#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
33363 33495 initial+ 1 1 {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
33496 33611 INTRON {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33612 33741 internal+ 2 2 {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33742 33842 INTRON {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33843 34677 terminal+ 3 3 {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36)
46879 46394 terminal- 4 6 {GeneID_mRNA_scaffold6_size143996_13;GeneID}
47512 46880 INTRON {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48256 47513 internal- 4 6 {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48366 48257 INTRON {Augustus_model.g41.t1;Augustus}
48429 48367 internal- 4 6 {Augustus_model.g41.t1;Augustus}
48510 48430 INTRON {Augustus_model.g41.t1;Augustus}
48564 48511 initial- 4 6 {Augustus_model.g41.t1;Augustus}
A more readable version:
awk -v RS= '
match($0, /score\([^)]+\)/) && substr($0, RSTART+6, RLENGTH-7)+0 > 1000 {
ORS = RT
print
}' file
Upvotes: 1