Reputation: 702
This is an example of a bigger data. Imagine I have two dataframes like these:
import pandas as pd
import numpy as np
np.random.seed(42)
df1 = pd.DataFrame({'Depth':np.arange(0.5, 4.5, 0.5),
'Feat1':np.random.randint(20, 70, 8)})
df2 = pd.DataFrame({'Depth':[0.4, 1.1, 1.5, 2.2, 2.8],
'Rock':['Sand','Sand','Clay','Clay','Marl']})
They have different size and I would like to put the information of 'Rock' column from df2 on df1 as a new column. This combination should be done based on the 'Depth' columns from these two dataframes, but they have different sampling rates. Df1 follows a constant step of 0.5, but the thickness of df2 is different.
So I would like to merge these information based on approximate values of 'Depth'. For example: if sample of df2 has 'Depth' of 2.2, then look at the most near 'Depth' value of df1, that should be 2.0, and add 'Rock' information ('Clay') on that sample. And it is important to say that 'Rock' values can be repeated on the new column to avoid missing data just inside this segmentation. Anyone could help me?
I already tried some pandas methods as 'merge' and 'combine_first', but I couldn't get the result I wanted. It should be something like this:
Upvotes: 3
Views: 906
Reputation: 35686
Use merge_asof
:
df3 = pd.merge_asof(df1, df2, on='Depth', tolerance=0.5, direction='nearest')
df3
:
Depth Feat1 Rock
0 0.5 58 Sand
1 1.0 48 Sand
2 1.5 34 Clay
3 2.0 62 Clay
4 2.5 27 Clay
5 3.0 40 Marl
6 3.5 58 NaN
7 4.0 38 NaN
Complete Working Example:
import numpy as np
import pandas as pd
np.random.seed(42)
df1 = pd.DataFrame({
'Depth': np.arange(0.5, 4.5, 0.5),
'Feat1': np.random.randint(20, 70, 8)
})
df2 = pd.DataFrame({
'Depth': [0.4, 1.1, 1.5, 2.2, 2.8],
'Rock': ['Sand', 'Sand', 'Clay', 'Clay', 'Marl']
})
df3 = pd.merge_asof(df1, df2, on='Depth', tolerance=0.5, direction='nearest')
print(df3)
Upvotes: 7