Reputation: 10383
I'm working on a dataset from some old kagle competicion,
I want to make some aggregations from one of the tables:
To know on which days the disscount is bigger, my expected output is like this:
For that I tried with this code:
coupon_list[[ 'USABLE_DATE_MON', 'USABLE_DATE_TUE', 'USABLE_DATE_WED',
'USABLE_DATE_THU', 'USABLE_DATE_FRI', 'USABLE_DATE_SAT',
'USABLE_DATE_SUN','DISCOUNT_PRICE']].melt("DISCOUNT_PRICE").groupby("variable").agg({"DISCOUNT_PRICE":sum,"value":sum})
But for that aggregation the disscount_price is the average of all the table, not the average by day.
To solve this I created a new table:
coupon_list_usable["DISCOUNT_PRICE"] = coupon_list_usable.apply(lambda x: x.DISCOUNT_PRICE if x.value==1 else 0,axis=1 )
coupon_list_usable.groupby("variable").agg({"DISCOUNT_PRICE":sum,"value":sum}).reset_index()[["variable","DISCOUNT_PRICE"]].set_index("variable").plot.bar()
But isn't a pythonic solution, is it possible to do this in the groupby itself?
Upvotes: 2
Views: 149
Reputation: 862406
Use query
forfor filter rows with 1
and for aggregate use GroupBy.sum
:
Notice: dropna
here cannot be used, because there are also 2
values in data of days columns.
s = (coupon_list.melt("DISCOUNT_PRICE")
.query('value == 1')
.groupby("variable")['DISCOUNT_PRICE']
.sum())
s.plot.bar()
You can also change ordering of days with reindex
or by ordered Categorical
s:
days = ['USABLE_DATE_MON', 'USABLE_DATE_TUE', 'USABLE_DATE_WED',
'USABLE_DATE_THU', 'USABLE_DATE_FRI', 'USABLE_DATE_SAT',
'USABLE_DATE_SUN']
s = (coupon_list.melt("DISCOUNT_PRICE")
.query('value == 1')
.groupby("variable")['DISCOUNT_PRICE']
.sum()
.reindex(days))
days = ['USABLE_DATE_MON', 'USABLE_DATE_TUE', 'USABLE_DATE_WED',
'USABLE_DATE_THU', 'USABLE_DATE_FRI', 'USABLE_DATE_SAT',
'USABLE_DATE_SUN']
s = (coupon_list.melt("DISCOUNT_PRICE", var_name='days', value_name='data')
.assign(days = lambda x: pd.Categorical(x['days'],
ordered=True,
categories=days))
.query('value == 1')
.groupby("days")['DISCOUNT_PRICE']
.sum())
Sample:
coupon_list = pd.DataFrame({
'USABLE_DATE_MON':[np.nan,np.nan,np.nan,1,1,np.nan],
'USABLE_DATE_TUE':[1,np.nan,1,np.nan,1,np.nan],
'USABLE_DATE_WED':[1,np.nan,np.nan,np.nan,1,1],
'USABLE_DATE_THU':[1,1,np.nan,1,1,np.nan],
'USABLE_DATE_FRI':[np.nan,1,2,np.nan,1,np.nan],
'USABLE_DATE_SAT':[1,1,np.nan,1,1,2],
'USABLE_DATE_SUN':[1,np.nan,1,1,1,1],
'DISCOUNT_PRICE':[2,3,6,2,2,4],
})
print (coupon_list)
USABLE_DATE_MON USABLE_DATE_TUE USABLE_DATE_WED USABLE_DATE_THU \
0 NaN 1.0 1.0 1.0
1 NaN NaN NaN 1.0
2 NaN 1.0 NaN NaN
3 1.0 NaN NaN 1.0
4 1.0 1.0 1.0 1.0
5 NaN NaN 1.0 NaN
USABLE_DATE_FRI USABLE_DATE_SAT USABLE_DATE_SUN DISCOUNT_PRICE
0 NaN 1.0 1.0 2
1 1.0 1.0 NaN 3
2 2.0 NaN 1.0 6
3 NaN 1.0 1.0 2
4 1.0 1.0 1.0 2
5 NaN 2.0 1.0 4
days = ['USABLE_DATE_MON', 'USABLE_DATE_TUE', 'USABLE_DATE_WED',
'USABLE_DATE_THU', 'USABLE_DATE_FRI', 'USABLE_DATE_SAT',
'USABLE_DATE_SUN']
s = (coupon_list.melt("DISCOUNT_PRICE", var_name='days', value_name='data')
.assign(days = lambda x: pd.Categorical(x['days'],
ordered=True,
categories=days))
.query('data == 1')
.groupby("days")['DISCOUNT_PRICE']
.sum())
print (s)
days
USABLE_DATE_MON 4
USABLE_DATE_TUE 10
USABLE_DATE_WED 8
USABLE_DATE_THU 9
USABLE_DATE_FRI 5
USABLE_DATE_SAT 9
USABLE_DATE_SUN 16
Name: DISCOUNT_PRICE, dtype: int64
s.plot.bar()
Upvotes: 3