Roy Banerjee
Roy Banerjee

Reputation: 61

Number of consecutive values above certain cut off

I am new in bash and linux programming. I have a small problem.

For a particular cut-off (c) I want to dump a file which will print out values above c if two consecutive values are above c. For example

x y
1 0.34
2 0.3432
3 0.32
4 0.35
5 0.323
6 0.3623
7 0.345

It will print out column 2 if c=0.33

0.34
0.3432
0.3623
0.345

It will not print out 0.35 despite it was above cut off 0.33 because the next value after 0.35 was 0.323 which fails the argument 'two consecutive values are above c'.

Upvotes: 0

Views: 199

Answers (4)

dawg
dawg

Reputation: 103783

The way you use a Bash parameter in awk is like so:

$ c=2.3
$ awk -v c="$c" 'BEGIN{print c}'
2.3

You can then use that to write you script like so:

c=0.33
m=2
awk -v c="$c" -v m="$m" '($2+0!=$2) {next}
                   $2+0<c {cnt=0; split("",lst); next}
                   $2+0>=c && cnt<m {lst[++cnt]=$2}
                   $2+0>=c && cnt==m {for (i=1; i<=m; i++) print lst[i]
                                    cnt=0; split("",lst)}' file

That will not print overlapping ranges such as:

1 0.34
2 0.3432     # prints 0.34\n0.3432\n here
3 0.35       # unclear if it should print 0.3432\n0.34\n  here....

Given the update, this will print contiguous runs of lines.

Given:

$ cat file
x y
1 0.34
2 0.3432
2a 0.35
3 0.32
4 0.35
5 0.323
6 0.3623
7 0.345

You can do:

c=0.33
m=2
awk -v c="$c" -v m="$m" '($2+0!=$2) {next}
             $2+0>=c {lst[++cnt]=$2; next}
             $2+0<c { if (cnt>=m) for (i=1; i<=cnt; i++) print lst[i]
                      cnt=0; split("",lst); next}
             END{if (cnt>=m) for (i=1; i<=cnt; i++) print lst[i]}' file

Prints:

0.34
0.3432
0.35
0.3623
0.345

Upvotes: 0

kvantour
kvantour

Reputation: 26471

Original Question: print all sequences where 2 or more consecutive values satisfy a given condition

The following should work :

awk 'p || (prev>c && $2>c && NR>2){print prev}
     { p = (prev>c && $2>c); prev=$2 }
     END{if(p) print $2 }' c=0.33 <file>

It makes the following logic :

  • p keeps track if the previous line has been printed. If it is printed then the current line should also be printed.
  • If the previous line is not printed (p==0), then you should check if you should print the previous line if (prev>c && $2>c)
  • Compute p for the next line and set prev to the current value
  • At the end, if p==1 print the last value.

You essentially always run one line behind.

Another way to approach this is checking if the value satisfies the condition and store it in an array. If you encounter a value that does not satisfy the condition, process the array. This is a bit more memory intensive :

awk '(NR==1){next}
     ($2>c) { a[NR]=$2; next }
     (length(a) == 1) { delete a[NR-1]; next }
     { for(i=NR-length(a);i<NR;++i) {print a[i]; delete a[i]} }
     END { if (length(a)>1) for(i=NR+1-length(a);i<=NR;++i) {print a[i]} }
    ' c=0.33 <file>

Second question: print the subset of consecutive values of $2 for which m or more values satisfy condition cond and at most n consecutive values do not satisfy cond. The sequence starts and ends with a value satisfying cond

The following awk script will do this. Don't forget to adjust the values m, n and c to your wishes and update the conditional function.

function cond(val) { return val > c }
BEGIN{c=0.33; m=2; n=1}
# skip the header
(NR==1){next}
# if no values satisfy cond ...
(M==0 && !cond($2)) { next }
# ... otherwise continue from here
{ a[NR]=$2 }
# set counters M and N (M satisfy cond, N not )
 cond($2) { M++; N=0 }
!cond($2) { N++ }
# This sequence failed, delete it
(N>n && M<m) { for(i in a) delete a[i]; M=0; N=0 }
# This sequence is OK, strip it and print it
(N>n) { j=NR; while (!cond(a[j])) delete a[j--]
        for (i=j+1-length(a);i<=j;++i) { print a[i]; delete a[i] }
        M=0; N=0 }
# Check if the final stored sequence is successful
END { if (M>=m) { 
         j=NR; while (!cond(a[j])) delete a[j--]
         for (i=j+1-length(a);i<=j;++i) print a[i]
      }
    }

Upvotes: 1

choroba
choroba

Reputation: 241828

Perl solution:

c=.33 m=2 perl -lane '
if ($F[1] > $ENV{c}) { push @r, $F[1] }
else {
    if (@r >= $ENV{m}) { print for @r }
    @r = ();
}
END { if (@r >= $ENV{m}) { print for @r } }' -- file

It stores the consecutive values into an array @r, if the current value is under threshold, it prints the array if it's long enough.

  • -l removes newlines from input and adds them to output
  • -n reads the input line by line
  • -a autosplits each line into the @F array
  • array used in numeric context retunrs its size
  • the %ENV hash contains the environmental variables

If the sequences tend to be very long, you can only store the first m elements in the array to save some memory.

if ($F[1] > $ENV{c}) {
    push @r, $F[1];
    print shift @r if @r > $ENV{m};
} else {
    if (@r >= $ENV{m}) { print for @r }
    @r = ();
}
END { if (@r >= $ENV{m}) { print for @r } }'

Upvotes: 0

oliv
oliv

Reputation: 13249

You could use this awk script:

awk -v cutoff="0.33" '
  $2>cutoff{
    if(prev) 
      {print prev ORS $2 } 
    else 
      {prev=$2;next}
  }
  {prev=""}' file

It stores the value if above the cutoff in the prev variable and resets it at the next number.

Upvotes: 0

Related Questions