Ben Muller
Ben Muller

Reputation: 331

Pandas Groupby and Apply

I am performing a grouby and apply over a dataframe that is returning some strange results, I am using pandas 1.3.1

Here is the code:

ddf = pd.DataFrame({
    "id": [1,1,1,1,2]
})

def do_something(df):
    return "x"

ddf["title"] = ddf.groupby("id").apply(do_something)
ddf

I would expect every row in the title column to be assigned the value "x" but when this happens I get this data:

        id title
0        1   NaN
1        1     x
2        1     x
3        1   NaN
4        2   NaN

Is this expected?

Upvotes: 2

Views: 238

Answers (3)

jezrael
jezrael

Reputation: 862661

If need new column in aggregate function use GroupBy.transform, is necessary specified column after groupby used for processing, here id:

ddf["title"] = ddf.groupby("id")['id'].transform(do_something)

Or assign new column in function:

def do_something(x):
    x['title'] = 'x'
    return x

ddf = ddf.groupby("id").apply(do_something)

Explanation why not workin gis in another answers.

Upvotes: 1

Code_beginner
Code_beginner

Reputation: 92

Yes it is expected.

First of all the apply(do_something) part works like a charme, it is the groupby right before that causes the problem. A Groupby returns a groupby object, which is a little different to a normal dataframe. If you debug and inspect what the groupby returns, then you can see you need some form of summary function to use it(mean max or sum).If you run one of them as example like this:

df = ddf.groupby("id")
df.mean()

it leads to this result:

Empty DataFrame
Columns: []
Index: [1, 2]

After that do_something is applied to index 1 and 2 only; and then integrated into your original df. This is why you only have index 1 and 2 with x. For now I would recommend leave out the groupby since it is not clear why you want to use it here anyway. And have a deeper look into the groupby object

Upvotes: 1

Corralien
Corralien

Reputation: 120409

The result is not strange, it's the right behavior: apply returns a value for the group, here 1 and 2 which becomes the index of the aggregation:

>>> list(ddf.groupby("id"))
[(1,        # the group name (the future index of the grouped df)
     id     # the subset dataframe of the group 2
  0   1
  1   1
  2   1
  3   1),
 (2,        # the group name (the future index of the grouped df)
     id     # the subset dataframe of the group 2
  4   2)]

Why I have a result? Because the label of the group is found in the same of your dataframe index:

>>> ddf.groupby("id").apply(do_something)
id
1    x
2    x
dtype: object

Now change the id like this:

ddf['id'] += 10
#    id
# 0  11
# 1  11
# 2  11
# 3  11
# 4  12

ddf["title"] = ddf.groupby("id").apply(do_something)
#    id title
# 0  11   NaN
# 1  11   NaN
# 2  11   NaN
# 3  11   NaN
# 4  12   NaN

Or change the index:

ddf.index += 10
#    id
# 10  1
# 11  1
# 12  1
# 13  1
# 14  2

ddf["title"] = ddf.groupby("id").apply(do_something)
#     id title
# 10   1   NaN
# 11   1   NaN
# 12   1   NaN
# 13   1   NaN
# 14   2   NaN

Upvotes: 2

Related Questions