Sayantan Ghosh
Sayantan Ghosh

Reputation: 338

Compare a column in one dataframe with two other columns in a different dataframe?

I have created two data frames from two tsv files. The data frames are as follows:

Dataframe1 (df1)
chr    position
5      745
7      963
8      1024

Dataframe2 (df2)
chr    start    end
1      10       100
1      500      600
5      250      600
5      784      1045
7      98       980
7      11       85
8      450      1000
8      1546     1886
12     63       1400

Now, I want to create a new column of df1 which will give 'True' if for the same chr the position falls within the start and end (of df2). I am using the following code:

df1['Valid'] = np.where((df1['chr'] == df2['chr']) & (df1['position'] >= df2['start']) & (df1['position'] <= df2['end']),'True','False')

This is not working and giving the error message - ValueError: Can only compare identically-labeled Series objects. How to do this?

Expected output is:

Dataframe1 (df1)
chr    position    Valid
5      745         False
7      963         True
8      1024        False

Upvotes: 0

Views: 84

Answers (3)

Andy L.
Andy L.

Reputation: 25239

I use map and boolean mask with any and groupby

s = df2.chr.map(df1.set_index('chr').position)
df1['Valid'] =  df1.chr.map(((s>=df2.start) & (s<=df2.end)).groupby(df2.chr).any())

Out[340]:
   chr  position  Valid
0    5       745  False
1    7       963   True
2    8      1024  False

Upvotes: 0

Alexander
Alexander

Reputation: 109526

Merge the dataframes, evaluate, then drop the unused columns.

>>> (df1
 .merge(df2, on='chr', how='left')
 .assign(Valid=lambda df: df.eval('start <= position <= end'))
 .drop(columns=['start', 'end'])
)
   chr  position  Valid
0    5       745  False
1    7       963   True
2    8      1024  False

In the case of multiple chr values in df2, merge the position onto df2, evaluate each, and then group on chr and determine if any position is valid. Assign the result back to df1:

valid = (
    df2
    .merge(df1, on='chr', how='right')
    .assign(Valid=lambda df: df.eval('start <= position <= end'))
    .groupby('chr')['Valid'].any()
)
>>> df1.merge(valid, left_on='chr', right_index=True)
   chr  position  Valid
0    5       745  False
1    7       963   True
2    8      1024  False

Upvotes: 2

eugen
eugen

Reputation: 1329

Have a look:



import pandas as pd
import numpy as np
df1 = pd.DataFrame({'chr': [5,7,8],
                   'position': [745, 963,1024]})

df2 = pd.DataFrame({'chr': [1,5,7,8,12],
                   'start': [10,250,98,450,63],
                    'end':[100,600,980,1000,1400]})

df2 = df2.loc[df2['chr'].isin(df1.chr)].reset_index(drop=True)

df1['Valid'] = np.where((df1['chr'] == df2['chr']) & (df1['position'] >= df2['start']) & (df1['position'] <= df2['end']),'True','False')
print(df1)

and the output:

   chr  position  Valid
0    5       745  False
1    7       963   True
2    8      1024  False

Upvotes: 0

Related Questions