Reputation: 183
I have 2 data frames
print(df1)
Name df1 RT [min] Molecular Weight RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2
0 unknow compound 1 7.590 194.04212 7.790 7.390 194.24212 193.84212
1 unknow compound 2 7.510 194.15000 7.710 7.310 194.35000 193.95000
2 unknow compound 3 7.410 194.04209 7.610 7.210 194.24209 193.84209
3 unknow compound 4 7.434 342.11615 7.634 7.234 342.31615 341.91615
4 unknow compound 5 0.756 176.03128 0.956 0.556 176.23128 175.83128
and
print(df2)
Name df2 Molecular Weight RT [min]
0 β-D-Glucopyranuronic acid 194.04220 7.483
1 α,α-Trehalose 194.10000 7.350
2 Threonylserine 206.08970 8.258
3 Terephthalic acid 166.02595 7.465
4 Sulfuric acid 97.96714 8.909
I want to merge the rows from df2 to those in df1 if 2 conditions are met.
This means that if a row from df2 meets the conditions of two other rows from df1, the row of df1 will be duplicated.
so df3 should look
print(df3)
Name df1 RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2 Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.790 7.390 194.24212 193.84212 β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 1 7.790 7.390 194.24212 193.84212 α,α-Trehalose 194.1000 7.350
2 unknow compound 2 7.710 7.310 194.35000 193.95000 β-D-Glucopyranuronic acid 194.0422 7.483
3 unknow compound 3 8.310 7.910 206.30000 205.90000 Threonylserine 206.0897 8.258
4 unknow compound 4 7.634 7.234 342.31615 341.91615 NaN NaN NaN
5 unknow compound 5 0.956 0.556 176.23128 175.83128 NaN NaN NaN
the first row in df2 meets the 2 conditions of the unknow compound 1 and unknow compound 2 from df1 so we have it twice in df3.
the second row in df2 meets the 2 conditions of the unknow compound 1 only.
the third row in df2 meets the 2 conditions of the unknow compound 3 only.
all the other rows do not meet any of the conditions from df1.
I tried to do it based on How to join two data frames for which column values are within a certain range? the first answer
import pandas as pd
df_1 = pd.read_excel (r'D:\CD SandBox\df1.xlsx')
df_2 = pd.read_excel (r'D:\CD SandBox\df2.xlsx')
df2.index = pd.IntervalIndex.from_arrays(df2['RT [min]-0.2'],df2['RT [min]+0.2'],closed='both')
df2['RT [min]'] = df2['RT [min]'].apply( lambda x : df2.iloc[df1.index.get_loc(x)])
But do not know what to do with the second line of code and receive this error:
df2['RT [min]'] = df2['RT [min]'].apply( lambda x : df2.iloc[df1.index.get_loc(x)])
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\BCDD\Anaconda3\envs\PTSD\lib\site-packages\pandas\core\series.py", line 4213, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\_libs\lib.pyx", line 2403, in pandas._libs.lib.map_infer
File "<input>", line 1, in <lambda>
File "C:\Users\BCDD\Anaconda3\envs\PTSD\lib\site-packages\pandas\core\indexes\interval.py", line 730, in get_loc
raise KeyError(key)
KeyError: 8.258
EDIT: Trying using merge_asof
according to How to join two DataFrames with multiple overlapping timestamps using an extra shared variable
df2 = df2.drop(['RT [min]', 'Molecular Weight'], axis=1)
df2['RT [min]']=df2['RT [min]-0.2']
pd.merge_asof(df2[['RT [min]','Name df2']] , df1,on='RT [min]',direction ='forward',allow_exact_matches =True)
...
RT [min] Name df2 Name df1 Molecular Weight
0 0.556 unknow compound 5 α,α-Trehalose 194.10000
1 7.210 unknow compound 3 α,α-Trehalose 194.10000
2 7.234 unknow compound 4 α,α-Trehalose 194.10000
3 7.310 unknow compound 2 α,α-Trehalose 194.10000
4 7.390 unknow compound 1 Terephthalic acid 166.02595
gives wrong matching for the table.
Any idea\ hint will be appreciated
Upvotes: 1
Views: 123
Reputation: 4761
If you're using pandas 1.2.0 you can create the cartesian product of both dataframes and then check the conditions. Also, as you don't need RT [min]
and Molecular Weight
from df1
, I'll assume you already removed them:
df3 = df1.merge(df2, how = 'cross', suffixes = [None,None])
#check if 'Molecular Weight' is in the interval:
mask1 = df3['Molecular Weight'].ge(df3['Molecular Weight - 0.2']) & df3['Molecular Weight'].le(df3['Molecular Weight + 0.2'])
#check if 'RT [min]' is in the interval
mask2 = df3['RT [min]'].ge(df3['RT [min]-0.2']) & df3['RT [min]'].le(df3['RT [min]+0.2'])
df3 = df3[mask1 & mask2].reset_index(drop = True)
Output:
df3
Name df1 RT [min]+0.2 RT [min]-0.2 ... Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.79 7.39 ... β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 2 7.71 7.31 ... β-D-Glucopyranuronic acid 194.0422 7.483
2 unknow compound 2 7.71 7.31 ... α,α-Trehalose 194.1000 7.350
3 unknow compound 3 7.61 7.21 ... β-D-Glucopyranuronic acid 194.0422 7.483
4 unknow compound 3 7.61 7.21 ... α,α-Trehalose 194.1000 7.350
As your data is considerably large, may you like to use a generator in order to don't load the whole resulting dataframe. Again, I'm assuming you removed RT [min]
and Molecular Weight
from df1
.
import numpy as np
from itertools import product
def df_iter(df1,df2):
for row1, row2 in product(df1.values, df2.values):
# RT [min]-0.2 <= RT [min] <= RT [min]+0.2
if row1[2] <= row2[2] <= row1[1]:
#Molecular Weight - 0.2 <= Molecular Weight <= Molecular Weight + 0.2
if row1[4] <= row2[1] <= row1[3]:
yield np.concatenate((row1,row2))
df3_rows = df_iter(df1,df2)
Then you can manipulate the rows:
for row in df3_rows:
print(row)
Output:
['unknow compound 1' 7.79 7.39 194.24212 193.84212 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'α,α-Trehalose' 194.1 7.35]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'α,α-Trehalose' 194.1 7.35]
Or create a dataframe:
df3 = pd.DataFrame(data = list(df3_rows),
columns = np.concatenate((df1.columns, df2.columns)))
Which results in the same dataframe from Option 1.
NOTE1: Be careful with the indices in the conditionals from function df_iter
, those work in my case.
NOTE2: I'm pretty sure your data doesn't match with the example df3
.
Upvotes: 1