Reputation: 2475
Lets say you have a DataFrame of regions (start, end) coordinates and another DataFrame of positions which may or may not fall within a given region. For example:
region = pd.DataFrame({'chromosome': [1, 1, 1, 1, 2, 2, 2, 2], 'start': [1000, 2000, 3000, 4000, 1000, 2000, 3000, 4000], 'end': [2000, 3000, 4000, 5000, 2000, 3000, 4000, 5000]})
position = pd.DataFrame({'chromosome': [1, 2, 1, 3, 2, 1, 1], 'BP': [1500, 1100, 10000, 2200, 3300, 400, 5000]})
print region
print position
chromosome end start
0 1 2000 1000
1 1 3000 2000
2 1 4000 3000
3 1 5000 4000
4 2 2000 1000
5 2 3000 2000
6 2 4000 3000
7 2 5000 4000
BP chromosome
0 1500 1
1 1100 2
2 10000 1
3 2200 3
4 3300 2
5 400 1
6 5000 1
A position falls within a region if:
position['BP'] >= region['start'] &
position['BP'] <= region['end'] &
position['chromosome'] == region['chromosome']
Each position is guaranteed to fall within a maximum of one region although it might not fall in any.
What is the best way to merge these two dataframe such that it appends additional columns to position with the region it falls in if it falls in any region. Giving in this case roughly the following output:
BP chromosome start end
0 1500 1 1000 2000
1 1100 2 1000 2000
2 10000 1 NA NA
3 2200 3 NA NA
4 3300 2 3000 4000
5 400 1 NA NA
6 5000 1 4000 5000
One approach is to write a function to compute the relationship I want and then to use the DataFrame.apply method as follows:
def within(pos, regs):
istrue = (pos.loc['chromosome'] == regs['chromosome']) & (pos.loc['BP'] >= regs['start']) & (pos.loc['BP'] <= regs['end'])
if istrue.any():
ind = regs.index[istrue].values[0]
return(regs.loc[ind ,['start', 'end']])
else:
return(pd.Series([None, None], index=['start', 'end']))
position[['start', 'end']] = position.apply(lambda x: within(x, region), axis=1)
print position
BP chromosome start end
0 1500 1 1000 2000
1 1100 2 1000 2000
2 10000 1 NaN NaN
3 2200 3 NaN NaN
4 3300 2 3000 4000
5 400 1 NaN NaN
6 5000 1 4000 5000
But I'm hoping that there is a more optimized way than doing each comparison in O(N) time. Thanks!
Upvotes: 6
Views: 4900
Reputation: 2475
The best way I found to solve this problem on my own large datasets was using the intersect method of bedtools which was wrapped in python by pybedtools (http://pythonhosted.org/pybedtools/) since the problem really boils down to interecting two sets of regions (one of which in this case is just of length 1).
Upvotes: 0
Reputation: 77961
One solution would be to do an inner-join on chromosome
, exclude the violating rows, and then do left-join with position
:
>>> df = pd.merge(position, region, on='chromosome', how='inner')
>>> idx = (df['BP'] < df['start']) | (df['end'] < df['BP']) # violating rows
>>> pd.merge(position, df[~idx], on=['BP', 'chromosome'], how='left')
BP chromosome end start
0 1500 1 2000 1000
1 1100 2 2000 1000
2 10000 1 NaN NaN
3 2200 3 NaN NaN
4 3300 2 4000 3000
5 400 1 NaN NaN
6 5000 1 5000 4000
Upvotes: 5