Jalan
Jalan

Reputation: 97

extract rows when column value greater than specified and also incremented at least five times

With the following input,

08    V   3.8     0.0   23.456  60.459  60.459
09    M   4.4     0.0   24.960  72.301  72.301
10    L   4.4     0.0   25.301  95.197  95.197
11    L   1.9     0.0   25.410  99.173  99.173
12    L   1.7     0.0   25.484  99.862  99.862
104   V   7.1     0.0   0.374   5.225   5.225
105   L   0.7     0.0   0.374   5.119   5.119
169   V   4.6     0.1   0.000   31.658  1.658
170   S   5.7     0.0   0.000   32.117  1.117
171   S   5.7     0.0   0.000   32.117  5.001
260   Y   4.8     0.0   0.342   54.178  54.178
261   S   4.1     0.0   0.144   67.833  67.833
262   I   8.4     0.0   0.000   87.300  87.300
263   I   9.5     0.0   0.000   88.950  88.950
264   I  11.3     0.1   0.000   89.070  89.070

Output rows that match the following two conditions,

So the output for the above input should be as follows. The rows that start with 104 and 105 are excluded because these are only two consecutive numbers, while the rows that start with 169 and 170 are excluded because the 6th column value is less than 5.

08    V   3.8     0.0   23.456  60.459  60.459
09    M   4.4     0.0   24.960  72.301  72.301
10    L   4.4     0.0   25.301  95.197  95.197
11    L   1.9     0.0   25.410  99.173  99.173
12    L   1.7     0.0   25.484  99.862  99.862
260   Y   4.8     0.0   0.342   54.178  54.178
261   S   4.1     0.0   0.144   67.833  67.833
262   I   8.4     0.0   0.000   87.300  87.300
263   I   9.5     0.0   0.000   88.950  88.950
264   I  11.3     0.1   0.000   89.070  89.070

The first part of the code is straightforward with the following awk one-liner

LC_ALL=C awk '$6>5 {print}' input

But I am getting trumped with the second condition. Really appreciate any help with it.

Upvotes: 1

Views: 121

Answers (3)

jhnc
jhnc

Reputation: 16819

I'm not sure I understand the criterion but this code gives the correct output with your test data.

LC_ALL=C awk '
    function maybeprint() {
        if (n>=5 && ok) print buf
    }
    {
        if ($1 != p1+1) {
            maybeprint()
            buf = $0
            n = ok = 1
        } else {
            buf = buf RS $0
            ++n
        }
        p1 = $1
    }
    $6<=5 { ok = 0 }
    END { maybeprint() }
' input
  • accumulate lines into a buffer and save some state (number of lines in buffer, and whether all $6 have been > 5).
  • when $1 is not one more than previous value, print buffer if appropriate state, then clear it and reset state
  • after end of input, print buffer if appropriate state

Upvotes: 4

Fravadona
Fravadona

Reputation: 17216

UPDATE

OP's definition of "consecutive" is a little different from what I understood at first:

Let n0 be the value of $1 at iteration 0,
v0 be the value of $6 at iteration 0,
n1 be the value of v1 at iteration 1,
and v1 be the value of $6 at iteration 1.

Two consecutive lines must satisfy the three following conditions:

n0 + 1 == n1
v0 > 5
v1 > 5

Then if you need to output the lines that are part of a group of at least 5 consecutive lines then you can use something like this:

awk '
    {
        consecutive_count = ($6 > 5 ? ($1-1 == previous_1 ? consecutive_count+1 : 1) : 0)
        saved_lines[consecutive_count] = $0
        previous_1 = $1+0
    }
    consecutive_count == 5 {
        for (i = 1; i <= consecutive_count; i++)
            print saved_lines[i]
    }
    consecutive_count > 5
'

OLD ANSWER

As @Daweo said, you need to bufferize some input lines because you can't determine how many consecutive ones there will be while examining the current line. Because your constraint is of 5 consecutive lines, then you can just store 5 of them in an array, using for eg. NR modulo five as key.

Then you also need to determine the current number of "consecutive" lines; for that you'll have to "save" the current value of $1 and use it in the next line iteration.

Here you go:

awk '
    {
        buffered_lines[NR%5] = $0
        consecutives_lines = ($1-1 == previous_1 ? consecutives_lines + 1 : 1)
        previous_1 = $1+0
    }
    consecutives_lines == 5 {
        for (i = NR-4; i <= NR ; i++) {
            $0 = buffered_lines[i%5]
            if ($6+0 > 5)
                print
        }
    }
    consecutives_lines > 5 && $6+0 > 5
'

note: by setting $0 to buffered_lines[...] you can make awk redo the splitting for you and then access the 6th field of the "buffered line" as $6. You have to be careful when using this method as the current line will be lost. Here it isn't harmful as $0, $1, etc... are not used further down the code (and incidentally, the current line was also restored in $0 in the last iteration of the for loop).

08    V   3.8     0.0   23.456  60.459  60.459
09    M   4.4     0.0   24.960  72.301  72.301
10    L   4.4     0.0   25.301  95.197  95.197
11    L   1.9     0.0   25.410  99.173  99.173
12    L   1.7     0.0   25.484  99.862  99.862
260   Y   4.8     0.0   0.342   54.178  54.178
261   S   4.1     0.0   0.144   67.833  67.833
262   I   8.4     0.0   0.000   87.300  87.300
263   I   9.5     0.0   0.000   88.950  88.950
264   I  11.3     0.1   0.000   89.070  89.070

Upvotes: 3

Daweo
Daweo

Reputation: 36700

Really appreciate any help with it.

GNU AWK allows you to store values in array. In case you need to use value which is n lines before it is handy to use NR (number of row) as key. Consider following simple example, let file.txt content be

0 Able
2 Baker
4 Charlie
8 Dog
16 Easy

then

awk '{arr[NR]=$1;print "current value",arr[NR],"value 1 line before",(NR-1 in arr?arr[NR-1]:"n/a"),"value 2 lines before",(NR-2 in arr?arr[NR-2]:"n/a")}' file.txt

gives output

current value 0 value 1 line before n/a value 2 lines before n/a
current value 2 value 1 line before 0 value 2 lines before n/a
current value 4 value 1 line before 2 value 2 lines before 0
current value 8 value 1 line before 4 value 2 lines before 2
current value 16 value 1 line before 8 value 2 lines before 4

Explanation: I store 1st field values in array arr. I use so-called ternary operator to test if key is present in array arr and if so I print correspoding value, otherwise n/a.

(tested in GNU Awk 5.3.1)

Upvotes: 2

Related Questions