Reputation: 413
I have a file like this:
419 I 0.3529
420 S 0.3182
421 T 0.3740
422 Y 0.3872
423 I 0.3460
424 E 0.4409
425 S 0.3182
426 T 0.3740
427 Y 0.4141
428 I 0.3460
429 S 0.3131
430 Y 0.3838
431 T 0.3939
432 S 0.3101
and I am trying to make an Awk program to evaluate the third column for numbers greater than or equal to 0.4. If true, take 4 characters up and 4 down in that letter (second column). If there are multiple matches, I want one fixed-length string for each:
STYIESTYI
IESTYISYT
The first one comes because there is a match on the line numbered 424; the second is a (partially overlapping) match for the line numbered 427. How would I approach this?
Upvotes: 0
Views: 158
Reputation: 67467
awk
to the rescue!
$ awk '{a[NR]=$2; v[NR]=$3>0.4}
v[NR-4]{for(i=NR-8;i<=NR;i++)
printf "%s", a[i];
print ""}' file
STYIESTYI
IESTYISYT
If your file is big, a rolling window might be a better solution.
UPDATE: As per @Ed Morton's comment, the script always expects that there will be 4 trailing lines after a match. If not so, special END block need to be added to handle the dangling lines.
$ awk '{a[NR]=$2; v[NR]=$3>0.4}
v[NR-4]{for(i=NR-8;i<=NR;i++)
printf "%s", a[i];
print ""}
END{for(i=NR-3;i<=NR;i++)
if(v[i])
for(j=i-4;j<=i;j++)
printf "%s", a[j];
print ""}' file
Upvotes: 3
Reputation: 203219
$ cat tst.awk
BEGIN {
tgt = (tgt=="" ? 0.4 : tgt)
cxt = (cxt=="" ? 4 : cxt)
bef = (bef=="" ? cxt : bef)
aft = (aft=="" ? cxt : aft)
}
$3 >= tgt { hits[++numHits] = NR }
{ chars[NR] = $2 }
END {
for (hitNr=1; hitNr<=numHits; hitNr++) {
for (lineNr=(hits[hitNr]-bef); lineNr<=(hits[hitNr]+aft); lineNr++) {
printf "%s", (lineNr in chars ? chars[lineNr] : "")
}
print ""
}
}
$ awk -f tst.awk file
STYIESTYI
IESTYISYT
Note that this will behave sensibly if the line with the 3rd field >= 0.4 is closer than 4 lines to the start and/or end of the file - make sure to test those conditions with any potential answer as they are common rainy day cases for this type of problem that people providing potential solutions often forget to cover.
For example, try all potential solutions with this input file and see if you get the output you expect:
$ cat file1
421 T 0.3740
422 Y 0.3872
423 I 0.3460
424 E 0.4409
425 S 0.3182
426 T 0.3740
427 Y 0.4141
428 I 0.3460
429 S 0.3131
430 Y 0.3838
$ awk -f tst.awk file1
TYIESTYI
IESTYISY
or if you get missing output lines or lines with leading/trailing blanks or other undesirable chars or something else.
Note also that you can change the target value from 0.4 to something else, and you can change the number context lines to print before and/or after the the matched line just by setting command line args, e.g.
To print 5 lines of context before and after 0.37:
$ awk -v tgt=0.37 -v cxt=5 -f tst.awk file
ISTYIEST
ISTYIESTY
ISTYIESTYIS
TYIESTYISYT
YIESTYISYTS
STYISYTS
TYISYTS
To print 1 line before and 2 lines after 0.34:
$ awk -v tgt=0.34 -v bef=1 -v aft=2 -f tst.awk file
IST
STYI
TYIE
YIES
IEST
STYI
TYIS
YISY
SYTS
YTS
Upvotes: 4
Reputation: 2689
How is this? Keep a track of the last four characters and print them off on a match. Then set num
to count off the next four characters. Note the use of printf
rather than print
to avoid an automatic newline.
// {if ($3 > 0.4) {printf "%s", v0 v1 v2 v3 ;
v0 = v1 = v2 = v3 = "";
num = 4}
if (num > 0) {
printf "%s", $2;
num = num -1;
} else { v0 = v1; v1 = v2; v2 = v3; v3 = $2; }
}
Upvotes: 1