Reputation: 121
I'm trying to filter one file based on two columns of another one.
The problem is that awk
is not differentiating, for example, this interval 70083 83083, from position 7323573 (please see below).
The aim is to retrieve the value for file 1 that is in the column 5 of file 2.
File 1 has only one position in the column 3 ex: 51476, and the file 2 has an interval represented by column 3 and 4.
In the end I need the file 1 with respective values of the column 5 (see output).
file 1
rs187298206 chr1 51476 0.0072 0.201426626822702
rs116400033 chr1 51479 0.2055 1.18445621536109
rs62637813 chr1 52058 0.0587 0.551216300225955
rs190291950 chr1 52144 -4e-04 0.036575951491895
rs150021059 chr1 52238 0.3325 1.70427928591544
rs140052487 chr1 54353 0.003 0.12778378962414
rs146477069 chr1 54421 0.1419 0.924336309646664
rs141149254 chr1 54490 0.1767 1.06786868821145
rs2462492 chr1 54676 0.0819 0.664355314594874
rs143174675 chr1 54753 0.026 0.356836206987615
rs3091274 chr1 55164 0.3548 1.80091078751368
rs10399749 chr1 55299 0.0309 0.389748348495465
rs182462964 chr1 55313 2e-04 0.0877969207975495
rs3107975 chr1 55326 0.0237 0.344080010917931
rs142800240 chr1 7323573 -6e-04 0.0361473609720785
file 2
51083_1 chr1 51083 56000 -0.177152387075888 0.172569306719619
57083_1 chr1 57083 60083 -0.0524335467819781 0.130497858911419
60083_1 chr1 70083 83083 -0.0332555672564894 0.124932838766226
525083_1 chr1 525083 528083 0.291406335374442 0.0577249392691202
528083_1 chr1 528083 531083 0.291406335374442 0.0577249392691202
531083_1 chr1 531083 534083 0.291406335374442 0.0577249392691202
534083_1 chr1 534083 537083 0.291406335374442 0.0577249392691202
534083_1 chr1 534083 537083 0.441406335374442 0.0577249392691202
What I get with this script:
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
if (key > $3 && key < $4)
print score[key], $5
}
' file1 file2 > output
output
rs140052487 chr1 54353 0.003 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
rs142800240 chr1 7323573 -6e-04 -0.0332555672564894 <- this should not appear
Upvotes: 1
Views: 1185
Reputation: 754550
I can reproduce your problem on Mac OS X 10.11.3 with the system's BSD awk
.
The problem is to do with string vs number comparison; awk
appears to be treating the key
as a string and is doing a string comparison rather than a numerical comparison.
I've brute-forced it into treating the comparison numerically with:
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
{
if (key+0 > $3+0 && key+0 < $4+0)
{
#print "==", key, $3, $4
#if (key > $3) print key, ">", $3
#if (key < $4) print key, "<", $4
print score[key], $5
}
}
}
' file1 file2
You can see the '+0
' to force awk
to treat things as numbers. (The analogue to force awk
to treat a value as a string is, for example, key ""
, which concatenates an empty string to the (string) value of key
.)
With your sample data, I then get the output:
rs140052487 chr1 54353 0.003 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
Part of the debugging output, which gave the game away, was:
== 54676 51083 56000
54676 > 51083
54676 < 56000
rs2462492 chr1 54676 0.0819 -0.177152387075888
== 7323573 70083 83083
7323573 > 70083
7323573 < 83083
rs142800240 chr1 7323573 -6e-04 -0.0332555672564894
For the 5-digit strings, the comparison happened to work the same as a numeric comparison. For the other, it did not. I should also point out that the $3+0
and $4+0
parts are probably not essential. I had those when I got the debugging output shown — but the tests only started to work when I added 0
to the key
. I probably don't need to add the 0
to $3
or $4
, therefore.
Upvotes: 1
Reputation: 694
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
if (key+0 > $3 && key+0 < $4)
print score[key], $5
}
' fst.txt tajima.txt > output
gives me
[/tmp]$ cat output
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs140052487 chr1 54353 0.003 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
to force the interpretation as a number, add 0 to it. from the man page for awk.
Upvotes: 3