Reputation: 5327
I have a table which contains intervals
dfa = pd.DataFrame({'Start': [0, 101, 666], 'Stop': [100, 200, 1000]})
I have another table which contains timestamps and values
dfb = pd.DataFrame({'Timestamp': [102, 145, 113], 'ValueA': [1, 2, 21],
'ValueB': [1, 2, 21]})
I need to create a dataframe same size as dfa
, with added a columns which contains the result of some aggregation of ValueA
/ValueB
, for all the rows in dfb
with a Timestamp
contained between Start
and Stop
.
So here if define my aggregation as
{'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
my desired output would be:
ValueA ValueA ValueB
nanmean nanmin nanmax Start Stop
nan nan nan 0 100
8 1 21 101 200
nan nan nan 666 1000
Upvotes: 1
Views: 74
Reputation: 862501
Use merge
with cross join with helper columns created by assign
:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
df = dfa.assign(A=1).merge(dfb.assign(A=1), on='A', how='outer')
Then filter by Start
and Stop
and aggregate by dictionary:
df = (df[(df.Timestamp >= df.Start) & (df.Timestamp <= df.Stop)]
.groupby(['Start','Stop']).agg(d))
Flatten MultiIndex by map
with join
:
df.columns = df.columns.map('_'.join)
print (df)
ValueA_nanmean ValueA_nanmin ValueB_nanmax
Start Stop
101 200 8 1 21
And last join
to original:
df = dfa.join(df, on=['Start','Stop'])
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN
EDIT:
Solution with cut
:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
#if not default index create it
dfa = dfa.reset_index(drop=True)
print (dfa)
Start Stop
0 0 100
1 101 200
2 666 1000
#add to bins first value of Start
bins = np.insert(dfa['Stop'].values, 0, dfa.loc[0, 'Start'])
print (bins)
[ 0 100 200 1000]
#binning
dfb['id'] = pd.cut(dfb['Timestamp'], bins=bins, labels = dfa.index)
print (dfb)
Timestamp ValueA ValueB id
0 102 1 1 1
1 145 2 2 1
2 113 21 21 1
#aggregate and flatten
df = dfb.groupby('id').agg(d)
df.columns = df.columns.map('_'.join)
#add to dfa
df = pd.concat([dfa, df], axis=1)
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN
Upvotes: 1