Reputation: 5059
I have two files and each file has 3 columns and n number of rows (different number of rows in each file).
each looks like this:
file1
chr1 12 32
chr1 14 30
chr3 10002 89000
chrx 5678900 987654
and this:
file2
chr1 8 15
chr1 10 14
chr1 32 34
the second and third column in each file represent the starting and ending values while the first column is a name.
Hence, if the value in first column (of file 1) matches to the value in first column of file 2 and then script should compute if their exists an overlap (any degree of overlap of the value range in column 2 and 3 in file1 with value range in column 2 and 3 of file 2) of the range of value in second and third column from file 1 in file2.
An output like this is desired:
regions from file1 present in file 2
chr1 12 32 present
chr1 14 30 present
chr3 10002 89000 absent
chrx 5678900 987654 absent
Any suggestions for awk manipulation or python script...please help.
Upvotes: 0
Views: 472
Reputation: 414207
Read file2
to create a mapping: name -> intervals i.e., the result is: ranges = {'chr1': [[8, 15], [10, 14], [32, 34]]}
. If there are many intervals for each name then as an optimization you could merge them: ranges = {'chr1': [[8, 15], [32, 34]]}
.
Define overlap(r1, r2)
function that returns whether two intervals r1
and r2
overlap. Specify whether the edges are included in the overlap.
For each line in file1
find out whether the overlap is present and print appropriate output.
Upvotes: 3