Reputation: 2475

Merging pandas dataframe based on relationship in multiple columns

Lets say you have a DataFrame of regions (start, end) coordinates and another DataFrame of positions which may or may not fall within a given region. For example:

region = pd.DataFrame({'chromosome': [1, 1, 1, 1, 2, 2, 2, 2], 'start': [1000, 2000, 3000, 4000, 1000, 2000, 3000, 4000], 'end': [2000, 3000, 4000, 5000, 2000, 3000, 4000, 5000]})
position = pd.DataFrame({'chromosome': [1, 2, 1, 3, 2, 1, 1], 'BP': [1500, 1100, 10000, 2200, 3300, 400, 5000]})
print region
print position


   chromosome   end  start
0           1  2000   1000
1           1  3000   2000
2           1  4000   3000
3           1  5000   4000
4           2  2000   1000
5           2  3000   2000
6           2  4000   3000
7           2  5000   4000

      BP  chromosome
0   1500           1
1   1100           2
2  10000           1
3   2200           3
4   3300           2
5    400           1
6   5000           1

A position falls within a region if:

position['BP'] >= region['start'] &
position['BP'] <= region['end'] &
position['chromosome'] == region['chromosome']

Each position is guaranteed to fall within a maximum of one region although it might not fall in any.

What is the best way to merge these two dataframe such that it appends additional columns to position with the region it falls in if it falls in any region. Giving in this case roughly the following output:

      BP  chromosome  start  end
0   1500           1  1000   2000
1   1100           2  1000   2000
2  10000           1  NA     NA
3   2200           3  NA     NA
4   3300           2  3000   4000
5    400           1  NA     NA
6   5000           1  4000   5000

One approach is to write a function to compute the relationship I want and then to use the DataFrame.apply method as follows:

def within(pos, regs):
    istrue = (pos.loc['chromosome'] == regs['chromosome']) & (pos.loc['BP'] >= regs['start']) &  (pos.loc['BP'] <= regs['end'])
    if istrue.any():
        ind = regs.index[istrue].values[0]
        return(regs.loc[ind ,['start', 'end']])
    else:
        return(pd.Series([None, None], index=['start', 'end']))

position[['start', 'end']] = position.apply(lambda x: within(x, region), axis=1)
print position

      BP  chromosome  start   end
0   1500           1   1000  2000
1   1100           2   1000  2000
2  10000           1    NaN   NaN
3   2200           3    NaN   NaN
4   3300           2   3000  4000
5    400           1    NaN   NaN
6   5000           1   4000  5000

But I'm hoping that there is a more optimized way than doing each comparison in O(N) time. Thanks!

Upvotes: 6

Answers (2)

dylkot

Reputation: 2475

The best way I found to solve this problem on my own large datasets was using the intersect method of bedtools which was wrapped in python by pybedtools (http://pythonhosted.org/pybedtools/) since the problem really boils down to interecting two sets of regions (one of which in this case is just of length 1).

Upvotes: 0

behzad.nouri

Reputation: 77961

One solution would be to do an inner-join on chromosome, exclude the violating rows, and then do left-join with position:

>>> df = pd.merge(position, region, on='chromosome', how='inner')
>>> idx = (df['BP'] < df['start']) | (df['end'] < df['BP'])  # violating rows
>>> pd.merge(position, df[~idx], on=['BP', 'chromosome'], how='left')
      BP  chromosome   end  start
0   1500           1  2000   1000
1   1100           2  2000   1000
2  10000           1   NaN    NaN
3   2200           3   NaN    NaN
4   3300           2  4000   3000
5    400           1   NaN    NaN
6   5000           1  5000   4000

Upvotes: 5

Merging pandas dataframe based on relationship in multiple columns

Answers (2)

Related Questions