Reputation: 391

panda aggregate by functions

I have data like below:

id  movie   details value
5   cane1   good    6
5   wind2   ok  30.3
5   wind1   ok  18
5   cane1   good    2
5   cane22  ok  4
5   cane34  good    7
5   wind2   ok  2

I want the output with below criteria:

If movie name starts with 'cane' - sum the value

If movie name starts with 'wind' - count the occurrence.

So - the final output will be:

id  movie   value
5   cane1   8
5   cane22  4
5   cane34  7
5   wind1   1
5   wind2   2

I tried to use:

movie_df.groupby(['id']).apply(aggr)

def aggr(x):
    if x['movie'].str.startswith('cane'):
        y = x.groupby(['value']).sum()

    else:
         y = x.groupby(['movie']).count()

    return y

But It's not working. Can anyone please help?

Upvotes: 2

Answers (4)

jpp

Reputation: 164693

You should aim for vectorised operations where possible.

You can calculate 2 results and then concatenate them.

mask = df['movie'].str.startswith('cane')

df1 = df[mask].groupby('movie')['value'].sum()
df2 = df[~mask].groupby('movie').size()

res = pd.concat([df1, df2], ignore_index=0)\
        .rename('value').reset_index()

print(res)

    movie  value
0   cane1    8.0
1  cane22    4.0
2  cane34    7.0
3   wind1    1.0
4   wind2    2.0

Upvotes: 2

Christian

Reputation: 1

I would start by creating a column which defines the required groups. For the example at hand this can be done with

df['group'] = df.movie.transform(lambda x : x[:4])

The next step would be to group by this column

df.groupby('group').apply(agg_fun)

using the following aggregation function

def agg_fun(grp):
    if grp.name == "cane":
        value=grp.value.sum()
    else:
        value=grp.value.count()
    return value

The output of this code is

group
cane    19.0
wind     3.0

Upvotes: 0

Retiefasaurus

Reputation: 96

There might be multiple ways of doing this. One way would to filter by the start of movie name first and then aggregate and merge afterwards.

cane = movie_df[movie_df['movie'].str.startswith('cane1')]
wind = movie_df[movie_df['movie'].str.startswith('wind')]

cane_sum = cane.groupby(['id']).agg({'movie':'first', 'value':'sum'}).reset_index()
wind_count = wind.groupby(['id']).agg({'movie':'first', 'value':'count'}).reset_index()

pd.concat([cane_sum, wind_count])

Upvotes: 2

JR ibkr

Reputation: 919

First of all, you need to perform string operation. I guess in your case you don't want digits in a movie name. Use solution discussed at pandas applying regex to replace values. Then you call groupby() on new series.

FYI: Some movie names have digits only; in that case, you need to use update function. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html

Upvotes: 0

panda aggregate by functions

Answers (4)

Related Questions