Reputation: 338
I have created two data frames from two tsv files. The data frames are as follows:
Dataframe1 (df1)
chr position
5 745
7 963
8 1024
Dataframe2 (df2)
chr start end
1 10 100
1 500 600
5 250 600
5 784 1045
7 98 980
7 11 85
8 450 1000
8 1546 1886
12 63 1400
Now, I want to create a new column of df1 which will give 'True' if for the same chr
the position falls within the start
and end
(of df2). I am using the following code:
df1['Valid'] = np.where((df1['chr'] == df2['chr']) & (df1['position'] >= df2['start']) & (df1['position'] <= df2['end']),'True','False')
This is not working and giving the error message - ValueError: Can only compare identically-labeled Series objects. How to do this?
Expected output is:
Dataframe1 (df1)
chr position Valid
5 745 False
7 963 True
8 1024 False
Upvotes: 0
Views: 84
Reputation: 25239
I use map
and boolean mask with any
and groupby
s = df2.chr.map(df1.set_index('chr').position)
df1['Valid'] = df1.chr.map(((s>=df2.start) & (s<=df2.end)).groupby(df2.chr).any())
Out[340]:
chr position Valid
0 5 745 False
1 7 963 True
2 8 1024 False
Upvotes: 0
Reputation: 109526
Merge the dataframes, evaluate, then drop the unused columns.
>>> (df1
.merge(df2, on='chr', how='left')
.assign(Valid=lambda df: df.eval('start <= position <= end'))
.drop(columns=['start', 'end'])
)
chr position Valid
0 5 745 False
1 7 963 True
2 8 1024 False
In the case of multiple chr
values in df2
, merge the position
onto df2
, evaluate each, and then group on chr
and determine if any position is valid. Assign the result back to df1
:
valid = (
df2
.merge(df1, on='chr', how='right')
.assign(Valid=lambda df: df.eval('start <= position <= end'))
.groupby('chr')['Valid'].any()
)
>>> df1.merge(valid, left_on='chr', right_index=True)
chr position Valid
0 5 745 False
1 7 963 True
2 8 1024 False
Upvotes: 2
Reputation: 1329
Have a look:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'chr': [5,7,8],
'position': [745, 963,1024]})
df2 = pd.DataFrame({'chr': [1,5,7,8,12],
'start': [10,250,98,450,63],
'end':[100,600,980,1000,1400]})
df2 = df2.loc[df2['chr'].isin(df1.chr)].reset_index(drop=True)
df1['Valid'] = np.where((df1['chr'] == df2['chr']) & (df1['position'] >= df2['start']) & (df1['position'] <= df2['end']),'True','False')
print(df1)
and the output:
chr position Valid
0 5 745 False
1 7 963 True
2 8 1024 False
Upvotes: 0