Reputation: 413

How to use if/else awk to evaluate a file and extract this information?

I have a file like this:

419 I     0.3529
420 S     0.3182
421 T     0.3740
422 Y     0.3872
423 I     0.3460
424 E     0.4409
425 S     0.3182
426 T     0.3740
427 Y     0.4141
428 I     0.3460
429 S     0.3131
430 Y     0.3838
431 T     0.3939
432 S     0.3101

and I am trying to make an Awk program to evaluate the third column for numbers greater than or equal to 0.4. If true, take 4 characters up and 4 down in that letter (second column). If there are multiple matches, I want one fixed-length string for each:

STYIESTYI
IESTYISYT

The first one comes because there is a match on the line numbered 424; the second is a (partially overlapping) match for the line numbered 427. How would I approach this?

Upvotes: 0

Answers (3)

karakfa

Reputation: 67467

awk to the rescue!

$ awk    '{a[NR]=$2; v[NR]=$3>0.4} 
   v[NR-4]{for(i=NR-8;i<=NR;i++) 
              printf "%s", a[i]; 
           print ""}' file

STYIESTYI
IESTYISYT

If your file is big, a rolling window might be a better solution.

UPDATE: As per @Ed Morton's comment, the script always expects that there will be 4 trailing lines after a match. If not so, special END block need to be added to handle the dangling lines.

$ awk    '{a[NR]=$2; v[NR]=$3>0.4}
   v[NR-4]{for(i=NR-8;i<=NR;i++)
              printf "%s", a[i];
           print ""}
      END{for(i=NR-3;i<=NR;i++) 
             if(v[i]) 
                 for(j=i-4;j<=i;j++) 
                    printf "%s", a[j]; 
          print ""}' file

Upvotes: 3

Ed Morton

Reputation: 203219

$ cat tst.awk
BEGIN {
    tgt = (tgt=="" ? 0.4 : tgt)
    cxt = (cxt=="" ?  4  : cxt)
    bef = (bef=="" ? cxt : bef)
    aft = (aft=="" ? cxt : aft)
}
$3 >= tgt { hits[++numHits] = NR }
{ chars[NR] = $2 }
END {
    for (hitNr=1; hitNr<=numHits; hitNr++) {
        for (lineNr=(hits[hitNr]-bef); lineNr<=(hits[hitNr]+aft); lineNr++) {
            printf "%s", (lineNr in chars ? chars[lineNr] : "")
        }
        print ""
    }
}

$ awk -f tst.awk file
STYIESTYI
IESTYISYT

Note that this will behave sensibly if the line with the 3rd field >= 0.4 is closer than 4 lines to the start and/or end of the file - make sure to test those conditions with any potential answer as they are common rainy day cases for this type of problem that people providing potential solutions often forget to cover.

For example, try all potential solutions with this input file and see if you get the output you expect:

$ cat file1
421 T     0.3740
422 Y     0.3872
423 I     0.3460
424 E     0.4409
425 S     0.3182
426 T     0.3740
427 Y     0.4141
428 I     0.3460
429 S     0.3131
430 Y     0.3838

$ awk -f tst.awk file1
TYIESTYI
IESTYISY

or if you get missing output lines or lines with leading/trailing blanks or other undesirable chars or something else.

Note also that you can change the target value from 0.4 to something else, and you can change the number context lines to print before and/or after the the matched line just by setting command line args, e.g.

To print 5 lines of context before and after 0.37:

$ awk -v tgt=0.37 -v cxt=5 -f tst.awk file
ISTYIEST
ISTYIESTY
ISTYIESTYIS
TYIESTYISYT
YIESTYISYTS
STYISYTS
TYISYTS

To print 1 line before and 2 lines after 0.34:

$ awk -v tgt=0.34 -v bef=1 -v aft=2 -f tst.awk file
IST
STYI
TYIE
YIES
IEST
STYI
TYIS
YISY
SYTS
YTS

Upvotes: 4

Neil Masson

Reputation: 2689

How is this? Keep a track of the last four characters and print them off on a match. Then set num to count off the next four characters. Note the use of printf rather than print to avoid an automatic newline.

// {if ($3 > 0.4) {printf "%s", v0 v1 v2 v3 ;
                   v0 = v1 = v2 = v3 = "";
                   num = 4} 
    if (num > 0) {
        printf "%s", $2;
        num = num -1;
    } else { v0 = v1; v1 = v2; v2 = v3; v3 = $2; }
    }

Upvotes: 1

How to use if/else awk to evaluate a file and extract this information?

Answers (3)

Related Questions