Reputation: 141
I have df like:
material plant Order
24990 89952 4568789,5098710
24990 89952 9448609,1007081
166621 3062 18364103
166621 3062 78309139
240758 3062 55146035
276009 3062 38501581,857542
and df1 like:
material plant Order m1 m2 m3 m4 m5
24990 89952 4568789 0.123 0.214 0.0 0.0 0.0
24990 89952 5098710 1.000 0.363 0.0 0.0 0.0
24990 89952 9448609 0.0 0.345 0.0 1.0 0.0
24990 89952 1007081 0.0 0.756 0.0 1.0 0.0
166621 3062 18364103 0.0 0.0 0.0 0.0 0.0
166621 3062 78309139 0.0 1.0 0.0 0.0 0.0
240758 3062 55146035 1.0 1.0 1.0 0.0 0.0
276009 3062 38501581 1.0 1.0 1.0 0.0 0.0
276009 3062 38575428 1.0 1.0 1.0 0.0 0.0
I want to iterate through Order in df1 and when there is a match of Order in df2 then find average from m1 to m5. I want to achieve df2 like:
material plant Order avg m1 avgm2 avgm3 avgm4 avgm5
24990 89952 4568789,5098710 0.5615 0.2885 0.0 0.0 0.0
24990 89952 9448609,1007081
166621 3062 18364103
166621 3062 78309139
240758 3062 55146035
276009 3062 38501581,857542
I am trying different ways to achieve df2 such as:
df2 = (df.groupby(df1, sort=False)['Order'].apply(lambda x: ','.split(x.astype(str)))
.mean()
.reset_index()
.reindex(columns=df.columns))
print (df2)
second:
group = df.columns[np.r_[0:3, 3:len(df.columns)]]
res = df1.groupby(group)['Order'].apply(list).mean().reset_index()
But I am not sure if this is the correct way to get it.
Upvotes: 4
Views: 119
Reputation: 164673
Here is one way via numpy
and a mapping dictionary.
# map Order to values with a dictionary
mapper = dict(zip(df1['Order'], df1[['m'+str(i) for i in range(1, 6)]].values))
# map comma-separated numbers to list of integers
df2_orders = [list(map(int, i)) for i in df2['Order'].str.split(',')]
# calculate mean
res = [np.mean([mapper.get(o, [0]*5) for o in order], axis=0).tolist() \
for order in df2_orders]
# join results to dataframe
df2 = df2.join(pd.DataFrame(res, columns=['avg_m'+str(i) for i in range(1, 6)]))
Note if data is missing (for example, order 857542
), you can specify what values to include, here I use 0s.
Result
material plant Order avg_m1 avg_m2 avg_m3 avg_m4 avg_m5
0 24990 89952 4568789,5098710 0.5615 0.2885 0.0 0.0 0.0
1 24990 89952 9448609,1007081 0.0000 0.5505 0.0 1.0 0.0
2 166621 3062 18364103 0.0000 0.0000 0.0 0.0 0.0
3 166621 3062 78309139 0.0000 1.0000 0.0 0.0 0.0
4 240758 3062 55146035 1.0000 1.0000 1.0 0.0 0.0
5 276009 3062 38501581,857542 0.5000 0.5000 0.5 0.0 0.0
Upvotes: 2
Reputation: 862681
You can use:
df = (df1.join(df1.set_index(['material','plant'], append=True)['Order']
.str.split(',', expand=True)
.stack()
.astype(int)
.reset_index(name='Order')
.merge(df2, on=['material','plant','Order'], how='left')
.drop(['material','plant','Order','level_3'], axis=1)
.groupby('level_0')
.mean())
)
print (df)
material plant Order m1 m2 m3 m4 m5
0 24990 89952 4568789,5098710 0.5615 0.2885 0.0 0.0 0.0
1 24990 89952 9448609,1007081 0.0000 0.5505 0.0 1.0 0.0
2 166621 3062 18364103 0.0000 0.0000 0.0 0.0 0.0
3 166621 3062 78309139 0.0000 1.0000 0.0 0.0 0.0
4 240758 3062 55146035 1.0000 1.0000 1.0 0.0 0.0
5 276009 3062 38501581,857542 1.0000 1.0000 1.0 0.0 0.0
Explanation:
DataFrame
by split
and stack
DataFrame
with merge
and left joindrop
mean
join
ed to first DataFrame
Detail:
df3 = (df1.set_index(['material','plant'], append=True)['Order']
.str.split(',', expand=True)
.stack()
.astype(int)
.reset_index(name='Order')
.merge(df2, on=['material','plant','Order'], how='left'))
print (df3)
level_0 material plant level_3 Order m1 m2 m3 m4 m5
0 0 24990 89952 0 4568789 0.123 0.214 0.0 0.0 0.0
1 0 24990 89952 1 5098710 1.000 0.363 0.0 0.0 0.0
2 1 24990 89952 0 9448609 0.000 0.345 0.0 1.0 0.0
3 1 24990 89952 1 1007081 0.000 0.756 0.0 1.0 0.0
4 2 166621 3062 0 18364103 0.000 0.000 0.0 0.0 0.0
5 3 166621 3062 0 78309139 0.000 1.000 0.0 0.0 0.0
6 4 240758 3062 0 55146035 1.000 1.000 1.0 0.0 0.0
7 5 276009 3062 0 38501581 1.000 1.000 1.0 0.0 0.0
8 5 276009 3062 1 857542 NaN NaN NaN NaN NaN
Upvotes: 1