Reputation: 165
In my dataset, I summed values across columns using groupby
, and attached the new summed values to the original dataframe like below.
import pandas as pd
data = {'Person':['A','B','C','D','E','F'],
'NY':[1, 1, 0, 3, 0, 0],
'LA':[0, 1, 1, 0, 1, 1],
'CHI':[2, 0, 1, 0, 0, 1],
'Gender':['F','F','F','M','M','NA'],
'Citizenship':['US','Canada','US','US','Mexico','Canada'],
'Age':['30s','30s','50s','NA','20s','30s']}
df = pd.DataFrame(data)
n_by_gender = df.groupby(['Gender']).sum()
n_by_citizenship = df.groupby(['Citizenship']).sum()
n_by_age = df.groupby(['Age']).sum()
df_new = pd.concat([df, n_by_gender, n_by_citizenship, n_by_age])
I noticed that when I concatenate these dataframes, Python automatically creates an index that is related to the groups I assigned, but that these aren't actual row names (like the image below - from Jupyter notebook). Instead of hard-coding row names, is there a way to assign row names based on the index? (e.g. 7th, 8th and 9th row would be named "Gender_F", "Gender_M", "Gender_NA"
, 10th and 11th row would be named "Citizenship_Canada", "Citizenship_Mexico"
, ...) Would it be easier to create a new column named ID
or something to store this information, instead of having them stored as row names?
Edit: final dataframe would look like this:
final = {'rownames':['NY','LA','CHI'],
'Gender_F':[2,1,0],
'Gender_M':[3,1,0],
'Gender_NA':[0,1,1],
'Citizenship_Canada':[1,2,1],
'Citizenship_Mexico':[0,1,0],
'Citizenship_US':[4,1,3],
'Age_20s':[0,1,0],
'Age_30s':[2,2,3],
'Age_50s':[0,1,1],
'Age_NA':[3,0,0]}
finaldf = pd.DataFrame(final)
finaldf
Upvotes: 1
Views: 671
Reputation: 16683
In summary, use .melt
to unpivot the dataframe into a long format, pd.get_dummies
to create dummy variable columns, np.where
, and .groupby
.
.melt
the dataframe and specify all of the columns that you are NOT eventually grouping by, so pass all the columns to the id_vars
parameter that are not 'NY', 'LA' or 'CHI'. A 'variable' column is automatically created with a corresponding 'value' column transforming the data from a wide format (i.e. pivoted format) to a long format (i.e. unpivoted format).pd.get_dummies()
and pass the columns that you want dummy variables for and specify the dtype
as 'int' in preparation for the next step. Alternatively, you can use the default dtype
for pd.get_dummies()
, which is np.uint8
, but I find it easier to just pass dtype='int'
.np.where()
to apply the 'value' column to any cells that returned a 1 when we did pd.get_dummies()
earlier.Finally, you are ready to do a .groupby() of the 'variable' column, summarizing the data by the three cities ('NY', 'LA', 'CHI').
import pandas as pd, numpy as np
df1 = df.melt(id_vars=['Person', 'Gender', 'Citizenship', 'Age'])
a = pd.get_dummies(df1, columns=['Gender', 'Citizenship', 'Age'], dtype='int')
for col in a.columns.to_list():
if col != ['value'] and a[col].dtype == 'int':
a[col] = np.where(a[col] == 1, a['value'], a[col])
b = a.groupby('variable').sum().reset_index().rename({'variable' : 'rownames'}, axis=1).drop('value', axis=1)
b
Upvotes: 2