Turtle
Turtle

Reputation: 1

Extract rows if max-min in that row fulfill certain conditions

I have a tab-delimited file with the lines as:

chr1 1001 + KEEP 0.5 0.3 0.06 0.4 0.2 0.3 0.5
chr1 40004 + KEEP 0.93 0.2 0.6 0.25 0.8 NA NA
chr2 140004 + KEEP 0.03 0.02 0.06 0.05 0.08 NA 0.01
chr2 455235504 - KEEP NA 0.12 0.67 0.51 0.8 NA NA
chr3 5004 + KEEP 0.53 0.52 NA 0.5 NA 0.54 NA
..

I want to calculate the difference between the max and min on each line from the 5th column onwards, and extract the rows in which this (max-min) difference is equal or more than 0.1. So for example with the input given above, I should get:

chr1 1001 + KEEP 0.5 0.3 0.06 0.4 0.2 0.3 0.5
chr1 40004 + KEEP 0.93 0.2 0.6 0.25 0.8 NA NA
chr2 455235504 - KEEP NA 0.12 0.67 0.51 0.8 NA NA
..

How can I use awk to do this for about a few million lines per file?

Currently I'm trying out

awk '{min=max=$5; 
      for(i=5;i<=67;i++){
        if($i<min) min=$i;
        if($i>max) max=$i
      }
      print min,max}' test.txt 

Upvotes: 0

Views: 87

Answers (4)

Ed Morton
Ed Morton

Reputation: 203324

$ cat tst.awk
{
    min = max = ""
    for (i=5; i<=NF; i++) {
        if ($i == $i+0) {
            min = ( (min == "") || ($i < min) ? $i : min)
            max = ( (max == "") || ($i > max) ? $i : max)
        }
    }
}
(max - min) >= 0.1

$ awk -f tst.awk file
chr1 1001 + KEEP 0.5 0.3 0.06 0.4 0.2 0.3 0.5
chr1 40004 + KEEP 0.93 0.2 0.6 0.25 0.8 NA NA
chr2 455235504 - KEEP NA 0.12 0.67 0.51 0.8 NA NA

The test for $i == $i+0 will only be true when $i is a number so it will discard cases where $i is "NA" or any other non-numeric value.

As an enhancement, you could easily tweak it to report lines that have no numeric values if you like:

{
    min = max = ""
    for (i=5; i<=NF; i++) {
        if ($i == $i+0) {
            min = ( (min == "") || ($i < min) ? $i : min)
            max = ( (max == "") || ($i > max) ? $i : max)
        }
    }
}
min == "" { printf "ERROR[%d]: \"%s\" has no numeric values.\n", NR, $0 | "cat>&2" }
(max - min) >= 0.1

Upvotes: 0

user3442743
user3442743

Reputation:

Another awk way

awk '{min=max=x;for(i=5;i<=NF;i++){min=min!~/./||$i<min?$i:min;max=$i+0>max?$i:max}}
      max-min>=.1' file

This sets min and max to 0 at the start of each line.
Then loops from 5 to the number of fields in the line.
Then if min is not set or is less than $i set it to $i.
If max is more than $i, set it to $i
The $i+0 is to remove the "NA" which will then be set to 0.
Print lines where max minus min is more than or equal to .1.

This answer assumes max will always be above 0.

If max can be less than 0

     awk '{min=max=x
     for(i=5;i<=NF;i++){min=min!~/./||$i<min?$i:min
     $i!="NA"&&max=max!~/./||$i>max?$i:max}
     } max-min>=.1' file

Upvotes: 0

John Zwinck
John Zwinck

Reputation: 249133

Here's a Python solution which has no extra dependencies and should run virtually anywhere:

import fileinput

for line in fileinput.input():
    strings = line.rstrip().split(' ')[4:]
    numbers = [float(string) for string in strings if string != 'NA']
    if max(numbers) - min(numbers) >= 0.1:
        print line,

Upvotes: 0

Barmar
Barmar

Reputation: 780869

Mainly you're just missing the test for whether the difference between min and max is more than your threshold. You should also use NF as the limit in the for loop, not hard-code a number like 67.

awk '{min=max=$5; 
      for(i=6;i<=NF;i++){
        if ($i == "NA") continue;
        if (min == "NA" || $i<min) min=$i;
        if (max == "NA" || $i>max) max=$i
      }
      if ((max - min) > .1) print}' test.txt 

I also added a check for NA, it skips those values.

Upvotes: 1

Related Questions