Dmitrij Burlaj
Dmitrij Burlaj

Reputation: 55

How to keep indexes when sum by columns based on grouped_by in pandas

I have a dataset where each ID has 6 corresponding rows. I want to this dataset grouped by the column ID and sum aggregate using sum. I wrote this piece of code:

col = [col for col in train.columns if col not in ['Month', 'ID']]
train.groupby('ID')[col].sum().reset_index()

Everything works fine except that I lose column ID. Now, Unique ID from my initial database disappeared and instead I have just enumerated ids from 0 up to the number of rows in the resulting dataset. I want to keep initial indexes, because I will need to merge this dataset with another further. How I can deal with this problem? Thanks for helping very much!

P.S: deleting reset_index() has no effect

P.S: You can see two problems on the images. On first image there is original database. You can see 6 entries for each ID. On the second image there is a databased which is a result from the grouped statement. First problem: IDs are not the same as in the original table. Second problem: the sum over 6 months for each ID is not correct.

Image1 Image2

Upvotes: 1

Views: 6672

Answers (2)

FabienP
FabienP

Reputation: 3138

Instead of using reset_index() you can simply use the keyword argument as_index: df.groupby('ID', as_index=False) This will preserve column ID in the resulting DataFrameGroupBy, as described in groupby()'s doc.

as_index : boolean, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

Upvotes: 2

Stefan Falk
Stefan Falk

Reputation: 25367

When you group a data frame by some columns, those columns become your new index.

import pandas as pd
import numpy as np

# Create data

n = 6; m = 3
col_id = np.hstack([['id-'+str(i)] * n for i in range(m)]).reshape(-1, 1)
np.random.shuffle(col_id)
data = np.random.rand(m*n, m) 
columns = ['v'+str(i+1) for i in range(m)]
df = pd.DataFrame(data, columns=columns)
df['ID'] = col_id

# Group by ID

print(df.groupby('ID').sum())

Will simply give you

            v1        v2        v3
ID                                
id-0  2.099219  2.708839  2.766141
id-1  2.554117  2.183166  3.914883
id-2  2.485505  2.739834  2.250873

If you just want the column ID back, you just have to reset_index()

print(df.groupby('ID').sum().reset_index())

which will leave you with

     ID        v1        v2        v3
0  id-0  2.099219  2.708839  2.766141
1  id-1  2.554117  2.183166  3.914883
2  id-2  2.485505  2.739834  2.250873

Note:

groupby will sort the resulting DataFrame by its index. If you don't want that for any reason just set sorted=False (see also the documentation)

print(df.groupby('ID', sorted=false).sum())

Upvotes: 1

Related Questions