Reputation: 391
I have data like below:
id movie details value
5 cane1 good 6
5 wind2 ok 30.3
5 wind1 ok 18
5 cane1 good 2
5 cane22 ok 4
5 cane34 good 7
5 wind2 ok 2
I want the output with below criteria:
If movie name starts with 'cane' - sum the value
If movie name starts with 'wind' - count the occurrence.
So - the final output will be:
id movie value
5 cane1 8
5 cane22 4
5 cane34 7
5 wind1 1
5 wind2 2
I tried to use:
movie_df.groupby(['id']).apply(aggr)
def aggr(x):
if x['movie'].str.startswith('cane'):
y = x.groupby(['value']).sum()
else:
y = x.groupby(['movie']).count()
return y
But It's not working. Can anyone please help?
Upvotes: 2
Views: 120
Reputation: 164693
You should aim for vectorised operations where possible.
You can calculate 2 results and then concatenate them.
mask = df['movie'].str.startswith('cane')
df1 = df[mask].groupby('movie')['value'].sum()
df2 = df[~mask].groupby('movie').size()
res = pd.concat([df1, df2], ignore_index=0)\
.rename('value').reset_index()
print(res)
movie value
0 cane1 8.0
1 cane22 4.0
2 cane34 7.0
3 wind1 1.0
4 wind2 2.0
Upvotes: 2
Reputation: 1
I would start by creating a column which defines the required groups. For the example at hand this can be done with
df['group'] = df.movie.transform(lambda x : x[:4])
The next step would be to group by this column
df.groupby('group').apply(agg_fun)
using the following aggregation function
def agg_fun(grp):
if grp.name == "cane":
value=grp.value.sum()
else:
value=grp.value.count()
return value
The output of this code is
group
cane 19.0
wind 3.0
Upvotes: 0
Reputation: 96
There might be multiple ways of doing this. One way would to filter by the start of movie name first and then aggregate and merge afterwards.
cane = movie_df[movie_df['movie'].str.startswith('cane1')]
wind = movie_df[movie_df['movie'].str.startswith('wind')]
cane_sum = cane.groupby(['id']).agg({'movie':'first', 'value':'sum'}).reset_index()
wind_count = wind.groupby(['id']).agg({'movie':'first', 'value':'count'}).reset_index()
pd.concat([cane_sum, wind_count])
Upvotes: 2
Reputation: 919
First of all, you need to perform string operation. I guess in your case you don't want digits in a movie name. Use solution discussed at pandas applying regex to replace values. Then you call groupby() on new series.
FYI: Some movie names have digits only; in that case, you need to use update function. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html
Upvotes: 0