Reputation: 23
I am trying to find pairs of intervals that overlap by at least some minimum overlap length that is set by the user. The intervals are from this pandas dataframe:
import pandas as pds
print(df1.head())
print(df1.tail())
query_id start_pos end_pos read_length orientation
0 1687655 1 4158 4158 F
1 2485364 1 7233 7233 R
2 1412202 1 3215 3215 R
3 1765889 1 3010 3010 R
4 2944965 1 4199 4199 R
query_id start_pos end_pos read_length orientation
3082467 112838 27863832 27865583 1752 F
3082468 138670 28431208 28431804 597 R
3082469 171683 28489928 28490409 482 F
3082470 2930053 28569533 28569860 328 F
3082471 1896622 28589281 28589554 274 R
where start_pos is where the interval starts and end_pos is where the interval ends. read_length is the length of the interval.
The data is sorted by start_pos.
The program should have the following output format:
query_id1 -- query_id2 -- read_length1 -- read_length2 -- overlap_length
I am running the program on a compute node with up to 512gb RAM and 4x Intel Xeon E7-4830 CPU (32 cores).
I've tried running my own code to find the overlaps, but it is taking too long to run.
Here is the code that I tried,
import pandas as pds
overlap_df = pds.DataFrame()
def create_overlap_table(df1, ovl_len):
...
(sort and clean the data here)
...
def iterate_queries(row):
global overlap_df
index1 = df1.index[df1['query_id'] == row['query_id']]
next_int_index = df1.index.get_loc(index1[0]) + 1
if row['read_length'] >= ovl_len:
if df1.index.size-1 >= next_int_index:
end_pos_minus_ovlp = (row['end_pos'] - ovl_len) + 2
subset_df = df1.loc[(df1['start_pos'] < end_pos_minus_ovlp)]
subset_df = subset_df.loc[subset_df.index == subset_df.index.max()]
subset_df = df1.iloc[next_int_index:df1.index.get_loc(subset_df.index[0])]
subset_df = subset_df.loc[subset_df['read_length'] >= ovl_len]
rows1 = pds.DataFrame({'read_id1': np.repeat(row['query_id'], repeats=subset_df.index.size), 'read_id2': subset_df['query_id']})
overlap_df = overlap_df.append(rows1)
df1.apply(iterate_queries, axis=1)
print(overlap_df)
Again, I ran this code on the compute node, but it was running for hours before I finally cancelled the job.
I've also tried using two packages for this problem--PyRanges, as well as an R package called IRanges, but they take too long to run as well. I've seen posts on interval trees and a python library called pybedtools, and I was planning on looking into them as a next step.
Any feedback would be really appreciated
EDIT: For minimum overlap length of, say 800, the first 5 rows should look like this,
query_id1 query_id2 read_length1 read_length2 overlap_length
1687655 2485364 4158 7233 4158
1687655 1412202 4158 3215 3215
1687655 1765889 4158 3010 3010
1687655 2944965 4158 4199 4158
2485364 1412202 7233 3215 3215
So, query_id1 and query_id2 cannot be identical. Also, no duplications (i.e., an overlap between A and B should not appear twice in the output).
Upvotes: 2
Views: 1225
Reputation: 32018
pyranges author here. Thanks for trying my library.
How big are your data? When both PyRanges were 1e7 rows, the heavy part done by pyranges took about 12 seconds using 24 cores on a busy server with 200 gb ram.
Setup:
import pyranges as pr
import numpy as np
import mkl
mkl.set_num_threads(1)
### Create data ###
length = int(1e7)
minimum_overlap = 5
gr = pr.random(length)
gr2 = pr.random(length)
# add ids
gr.id1 = np.arange(len(gr))
gr2.id2 = np.arange(len(gr))
# add lengths
gr.length1 = gr.lengths()
gr2.length2 = gr2.lengths()
gr
# +--------------+-----------+-----------+--------------+-----------+-----------+
# | Chromosome | Start | End | Strand | id1 | length1 |
# | (category) | (int32) | (int32) | (category) | (int64) | (int32) |
# |--------------+-----------+-----------+--------------+-----------+-----------|
# | chr1 | 146230338 | 146230438 | + | 0 | 100 |
# | chr1 | 199561432 | 199561532 | + | 1 | 100 |
# | chr1 | 189095813 | 189095913 | + | 2 | 100 |
# | chr1 | 27608425 | 27608525 | + | 3 | 100 |
# | ... | ... | ... | ... | ... | ... |
# | chrY | 21533766 | 21533866 | - | 9999996 | 100 |
# | chrY | 30105890 | 30105990 | - | 9999997 | 100 |
# | chrY | 49764407 | 49764507 | - | 9999998 | 100 |
# | chrY | 3319478 | 3319578 | - | 9999999 | 100 |
# +--------------+-----------+-----------+--------------+-----------+-----------+
# Stranded PyRanges object has 10,000,000 rows and 6 columns from 25 chromosomes.
Doing your analysis:
j = gr.join(gr2, new_pos="intersection", nb_cpu=24)
# CPU times: user 3.85 s, sys: 3.56 s, total: 7.41 s
# Wall time: 12.3 s
j.overlap = j.lengths()
out = j.df["id1 id2 length1 length2 overlap".split()]
out = out[out.overlap >= minimum_overlap]
out.head()
id1 id2 length1 length2 overlap
1 2 485629 100 100 74
2 2 418820 100 100 92
3 3 487066 100 100 13
4 7 191109 100 100 31
5 11 403447 100 100 76
Upvotes: 1
Reputation: 120229
Here's an algorithm.
Upvotes: 2