Angelo
Angelo

Reputation: 5059

comparing values in multiple files

I have two files and each file has 3 columns and n number of rows (different number of rows in each file).

each looks like this:

file1
chr1    12  32 
chr1    14  30
chr3    10002  89000 
chrx    5678900   987654

and this:

file2
chr1    8   15
chr1    10  14
chr1    32  34

the second and third column in each file represent the starting and ending values while the first column is a name.

Hence, if the value in first column (of file 1) matches to the value in first column of file 2 and then script should compute if their exists an overlap (any degree of overlap of the value range in column 2 and 3 in file1 with value range in column 2 and 3 of file 2) of the range of value in second and third column from file 1 in file2.

An output like this is desired:

regions from file1 present in file 2

chr1    12  32   present 
chr1    14  30   present 
chr3    10002  89000  absent
chrx    5678900   987654 absent

Any suggestions for awk manipulation or python script...please help.

Upvotes: 0

Views: 472

Answers (1)

jfs
jfs

Reputation: 414207

  1. Read file2 to create a mapping: name -> intervals i.e., the result is: ranges = {'chr1': [[8, 15], [10, 14], [32, 34]]}. If there are many intervals for each name then as an optimization you could merge them: ranges = {'chr1': [[8, 15], [32, 34]]}.

  2. Define overlap(r1, r2) function that returns whether two intervals r1 and r2 overlap. Specify whether the edges are included in the overlap.

  3. For each line in file1 find out whether the overlap is present and print appropriate output.

Upvotes: 3

Related Questions