Reputation: 31

Filtering text file based on values in some lines using awk

I am dealing with a text file and each record in the file is separated by blank line. I want to extract the records which meats certain criteria.

For example, my text file looks like this

#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
11477   11043   single- 4   6   {SNAP_model.scaffold6_size143996-snap.2;SNAP

#EVM prediction: Mode:STANDARD S-ratio: 1.00 20968-21183 orient(+) score(432.00)
20968   21183   single+ 1   3   {GeneID_mRNA_scaffold6_size143996_6;GeneID}

#EVM prediction: Mode:STANDARD S-ratio: 1.00 21940-22362 orient(-) score(846.00)
22362   21940   single- 4   6   {GeneID_mRNA_scaffold6_size143996_7;GeneID}

#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
33363   33495   initial+    1   1   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
33496   33611   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33612   33741   internal+   2   2   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33742   33842   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33843   34677   terminal+   3   3   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}

#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36) 
46879   46394   terminal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
47512   46880   INTRON          {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48256   47513   internal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48366   48257   INTRON          {Augustus_model.g41.t1;Augustus}
48429   48367   internal-   4   6   {Augustus_model.g41.t1;Augustus}
48510   48430   INTRON          {Augustus_model.g41.t1;Augustus}
48564   48511   initial-    4   6   {Augustus_model.g41.t1;Augustus}

Now, I want to extract the records with score greater 1000. I want to remove second and third record which has sccore-432 score(432.00)and score-846 score(846.00)

I have written awk code

awk -F '[()]' '{if ($4 > 1000) print $0}' input.out

but it is giving only first line as output. i.e

#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36)

But I want to extract complete record corresponding to the score greater than 1000. Please help to extract complete record

Upvotes: 3

Answers (4)

William Pursell

Reputation: 212248

awk '{split($8,k,"[()]")} k[2]> 1000' RS= ORS='\n\n' input

When RS is set to the empty string, awk will treat a blank line as the record separater. We set ORS to two newlines so that blank lines are retained in the output. This solution simply splits the 8th field on the parentheses. According to the problem description, that field is expected to be a string of the form score(N) (if that condition on the input is not met, some error checking should be added.) By splitting on that filed, we get N in k[2], so we simply check if that value is greater than 1000. When it is, the default rule to print the record is applied.

Upvotes: 0

SLePort

Reputation: 15461

With GNU sed:

sed -E '/score\([1-9][0-9]{3,}/, /^ *$/!d' file

Upvotes: 0

RavinderSingh13

Reputation: 133528

EDIT: As per OP's comment, adding edited code here.

awk '
/^#EVM/{ found=="" }
match($0,/score\([0-9]+\.[0-9]+\)/){
  found=1
  val=substr($0,RSTART,RLENGTH)
  gsub(/.*\(|)$/,"",val)
  if(val+0>1000){ print; next }
}
found
' Input_file

Could you please try following. Written and tested with shown samples in GNU awk.

awk '
match($0,/score\([0-9]+\.[0-9]+\)/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/.*\(|)$/,"",val)
  if(val+0>1000){ print }
}' Input_file

Explanation: Adding detailed explanation for above.

awk '                                  ##Starting awk program from here.
match($0,/score\([0-9]+\.[0-9]+\)/){   ##Using match function to match regex of (digits DOT digits ) in current line.
  val=substr($0,RSTART,RLENGTH)        ##Creating sub string of matched regex above and storing it to val here.
  gsub(/.*\(|)$/,"",val)               ##Globally substituting everything till ( and ) at last of line with NULL in val here.
  if(val+0>1000){ print }              ##If val is greater than 1000 then print line.
}' Input_file                          ##Mentioning Input_file name here.

Upvotes: 1

anubhava

Reputation: 785196

You may use this awk with an empty RS and match function:

awk -v RS= 'match($0, /score\([^)]+\)/) && substr($0, RSTART+6, RLENGTH-7)+0 > 1000 {ORS = RT; print}' file

#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
11477   11043   single- 4   6   {SNAP_model.scaffold6_size143996-snap.2;SNAP

#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
33363   33495   initial+    1   1   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
33496   33611   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33612   33741   internal+   2   2   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33742   33842   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33843   34677   terminal+   3   3   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}

#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36)
46879   46394   terminal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
47512   46880   INTRON          {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48256   47513   internal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48366   48257   INTRON          {Augustus_model.g41.t1;Augustus}
48429   48367   internal-   4   6   {Augustus_model.g41.t1;Augustus}
48510   48430   INTRON          {Augustus_model.g41.t1;Augustus}
48564   48511   initial-    4   6   {Augustus_model.g41.t1;Augustus}

A more readable version:

awk -v RS= '
match($0, /score\([^)]+\)/) && substr($0, RSTART+6, RLENGTH-7)+0 > 1000 {
   ORS = RT
   print
}' file

Upvotes: 1

Filtering text file based on values in some lines using awk

Answers (4)

Related Questions