Evaluate whether a column value is within a range of 2 other columns and print with awk

Question

I have two files.

File 1 (shortened):

1       1:135982[b38]A,G        0       12500   A       G
1       1:611317[b38]A,G        0       611317  A       G
1       1:722408[b38]G,C        0       722408  G       C
1       1:726649[b38]G,A        0       726649  G       A

File 2 (shortened):

ENSG00000223972 ENST00000456328 DDX11L1    1       11869   14409
ENSG00000223972 ENST00000450305 DDX11L1    1       11869   14409
ENSG00000227232 ENST00000488147 WASH7P     1       14404   29570

I want to create a new file where:

if column 1 in file 1 matches column 4 of file 2
AND column 4 of file 1 is within the range of columns 5 and 6 of file 2

Print column 3 of file 2 and column 2 of file 1.

Example:

For the first line of file 1, column 1 (1) matches column 4 of file 2 (also 1) AND column 4 of file 1 (12500) is >= 11869 and <= 14409, so the desired output would be:

DDX11L1   1:134982[b38]A,G

No other matches exist in this example so that would be the sole output.

How can I do this with awk?

jhnc · Accepted Answer

Not awk but trivial with sqlite:

#!/bin/bash

# convert spaces to tabs
tr -s ' 	' '	' file1.tsv
tr -s ' 	' '	' file2.tsv

sqlite3 >output.tsv <<'EOD'
create table t1 (c1,c2,c3,c4 int,c5,c6);
create table t2 (c1,c2,c3,c4,c5 int,c6 int);
.mode tabs
.import file1.tsv t1
.import file2.tsv t2
select distinct t2.c3, t1.c2 from t1,t2
where t1.c1=t2.c4 and t1.c4 between t2.c5 and t2.c6;
EOD

Evaluate whether a column value is within a range of 2 other columns and print with awk

Answers (1)

Related Questions