Reputation: 1
I have a tab-delimited file with the lines as:
chr1 1001 + KEEP 0.5 0.3 0.06 0.4 0.2 0.3 0.5
chr1 40004 + KEEP 0.93 0.2 0.6 0.25 0.8 NA NA
chr2 140004 + KEEP 0.03 0.02 0.06 0.05 0.08 NA 0.01
chr2 455235504 - KEEP NA 0.12 0.67 0.51 0.8 NA NA
chr3 5004 + KEEP 0.53 0.52 NA 0.5 NA 0.54 NA
..
I want to calculate the difference between the max and min on each line from the 5th column onwards, and extract the rows in which this (max-min) difference is equal or more than 0.1. So for example with the input given above, I should get:
chr1 1001 + KEEP 0.5 0.3 0.06 0.4 0.2 0.3 0.5
chr1 40004 + KEEP 0.93 0.2 0.6 0.25 0.8 NA NA
chr2 455235504 - KEEP NA 0.12 0.67 0.51 0.8 NA NA
..
How can I use awk to do this for about a few million lines per file?
Currently I'm trying out
awk '{min=max=$5;
for(i=5;i<=67;i++){
if($i<min) min=$i;
if($i>max) max=$i
}
print min,max}' test.txt
Upvotes: 0
Views: 87
Reputation: 203324
$ cat tst.awk
{
min = max = ""
for (i=5; i<=NF; i++) {
if ($i == $i+0) {
min = ( (min == "") || ($i < min) ? $i : min)
max = ( (max == "") || ($i > max) ? $i : max)
}
}
}
(max - min) >= 0.1
$ awk -f tst.awk file
chr1 1001 + KEEP 0.5 0.3 0.06 0.4 0.2 0.3 0.5
chr1 40004 + KEEP 0.93 0.2 0.6 0.25 0.8 NA NA
chr2 455235504 - KEEP NA 0.12 0.67 0.51 0.8 NA NA
The test for $i == $i+0
will only be true when $i
is a number so it will discard cases where $i
is "NA"
or any other non-numeric value.
As an enhancement, you could easily tweak it to report lines that have no numeric values if you like:
{
min = max = ""
for (i=5; i<=NF; i++) {
if ($i == $i+0) {
min = ( (min == "") || ($i < min) ? $i : min)
max = ( (max == "") || ($i > max) ? $i : max)
}
}
}
min == "" { printf "ERROR[%d]: \"%s\" has no numeric values.\n", NR, $0 | "cat>&2" }
(max - min) >= 0.1
Upvotes: 0
Reputation:
Another awk way
awk '{min=max=x;for(i=5;i<=NF;i++){min=min!~/./||$i<min?$i:min;max=$i+0>max?$i:max}}
max-min>=.1' file
This sets min and max to 0 at the start of each line.
Then loops from 5 to the number of fields in the line.
Then if min is not set or is less than $i set it to $i.
If max is more than $i, set it to $i
The $i+0 is to remove the "NA" which will then be set to 0.
Print lines where max minus min is more than or equal to .1.
This answer assumes max will always be above 0.
If max can be less than 0
awk '{min=max=x
for(i=5;i<=NF;i++){min=min!~/./||$i<min?$i:min
$i!="NA"&&max=max!~/./||$i>max?$i:max}
} max-min>=.1' file
Upvotes: 0
Reputation: 249133
Here's a Python solution which has no extra dependencies and should run virtually anywhere:
import fileinput
for line in fileinput.input():
strings = line.rstrip().split(' ')[4:]
numbers = [float(string) for string in strings if string != 'NA']
if max(numbers) - min(numbers) >= 0.1:
print line,
Upvotes: 0
Reputation: 780869
Mainly you're just missing the test for whether the difference between min
and max
is more than your threshold. You should also use NF
as the limit in the for
loop, not hard-code a number like 67
.
awk '{min=max=$5;
for(i=6;i<=NF;i++){
if ($i == "NA") continue;
if (min == "NA" || $i<min) min=$i;
if (max == "NA" || $i>max) max=$i
}
if ((max - min) > .1) print}' test.txt
I also added a check for NA
, it skips those values.
Upvotes: 1