Reputation: 55
I have a dataset where each ID
has 6 corresponding rows. I want to this dataset grouped by the column ID
and sum aggregate using sum. I wrote this piece of code:
col = [col for col in train.columns if col not in ['Month', 'ID']]
train.groupby('ID')[col].sum().reset_index()
Everything works fine except that I lose column ID. Now, Unique ID from my initial database disappeared and instead I have just enumerated ids from 0 up to the number of rows in the resulting dataset. I want to keep initial indexes, because I will need to merge this dataset with another further. How I can deal with this problem? Thanks for helping very much!
P.S: deleting reset_index() has no effect
P.S: You can see two problems on the images. On first image there is original database. You can see 6 entries for each ID. On the second image there is a databased which is a result from the grouped statement. First problem: IDs are not the same as in the original table. Second problem: the sum over 6 months for each ID is not correct.
Upvotes: 1
Views: 6672
Reputation: 3138
Instead of using reset_index()
you can simply use the keyword argument as_index
: df.groupby('ID', as_index=False)
This will preserve column ID
in the resulting DataFrameGroupBy
, as described in groupby()
's doc.
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
Upvotes: 2
Reputation: 25367
When you group a data frame by some columns, those columns become your new index.
import pandas as pd
import numpy as np
# Create data
n = 6; m = 3
col_id = np.hstack([['id-'+str(i)] * n for i in range(m)]).reshape(-1, 1)
np.random.shuffle(col_id)
data = np.random.rand(m*n, m)
columns = ['v'+str(i+1) for i in range(m)]
df = pd.DataFrame(data, columns=columns)
df['ID'] = col_id
# Group by ID
print(df.groupby('ID').sum())
Will simply give you
v1 v2 v3
ID
id-0 2.099219 2.708839 2.766141
id-1 2.554117 2.183166 3.914883
id-2 2.485505 2.739834 2.250873
If you just want the column ID
back, you just have to reset_index()
print(df.groupby('ID').sum().reset_index())
which will leave you with
ID v1 v2 v3
0 id-0 2.099219 2.708839 2.766141
1 id-1 2.554117 2.183166 3.914883
2 id-2 2.485505 2.739834 2.250873
Note:
groupby
will sort the resulting DataFrame by its index. If you don't want that for any reason just set sorted=False
(see also the documentation)
print(df.groupby('ID', sorted=false).sum())
Upvotes: 1