Jennifer
Jennifer

Reputation: 165

Python assign new row names to rows created by groupby

In my dataset, I summed values across columns using groupby, and attached the new summed values to the original dataframe like below.

import pandas as pd
data = {'Person':['A','B','C','D','E','F'], 
    'NY':[1, 1, 0, 3, 0, 0],
    'LA':[0, 1, 1, 0, 1, 1],
    'CHI':[2, 0, 1, 0, 0, 1],
    'Gender':['F','F','F','M','M','NA'],
    'Citizenship':['US','Canada','US','US','Mexico','Canada'],
    'Age':['30s','30s','50s','NA','20s','30s']} 
df = pd.DataFrame(data) 
n_by_gender = df.groupby(['Gender']).sum()
n_by_citizenship = df.groupby(['Citizenship']).sum()
n_by_age = df.groupby(['Age']).sum()
df_new = pd.concat([df, n_by_gender, n_by_citizenship, n_by_age])

I noticed that when I concatenate these dataframes, Python automatically creates an index that is related to the groups I assigned, but that these aren't actual row names (like the image below - from Jupyter notebook). Instead of hard-coding row names, is there a way to assign row names based on the index? (e.g. 7th, 8th and 9th row would be named "Gender_F", "Gender_M", "Gender_NA", 10th and 11th row would be named "Citizenship_Canada", "Citizenship_Mexico", ...) Would it be easier to create a new column named ID or something to store this information, instead of having them stored as row names?

Edit: final dataframe would look like this:

final = {'rownames':['NY','LA','CHI'],
         'Gender_F':[2,1,0],
         'Gender_M':[3,1,0],
         'Gender_NA':[0,1,1],
         'Citizenship_Canada':[1,2,1],
         'Citizenship_Mexico':[0,1,0],
         'Citizenship_US':[4,1,3],
         'Age_20s':[0,1,0],
         'Age_30s':[2,2,3],
         'Age_50s':[0,1,1],
         'Age_NA':[3,0,0]}
finaldf = pd.DataFrame(final)
finaldf

Upvotes: 1

Views: 671

Answers (1)

David Erickson
David Erickson

Reputation: 16683

In summary, use .melt to unpivot the dataframe into a long format, pd.get_dummies to create dummy variable columns, np.where, and .groupby.

  1. First, .melt the dataframe and specify all of the columns that you are NOT eventually grouping by, so pass all the columns to the id_vars parameter that are not 'NY', 'LA' or 'CHI'. A 'variable' column is automatically created with a corresponding 'value' column transforming the data from a wide format (i.e. pivoted format) to a long format (i.e. unpivoted format).
  2. Second use pd.get_dummies() and pass the columns that you want dummy variables for and specify the dtype as 'int' in preparation for the next step. Alternatively, you can use the default dtype for pd.get_dummies(), which is np.uint8, but I find it easier to just pass dtype='int'.
  3. In the third step, I loop through the relevant columns (all columns that are of dtype 'int' except the 'value' column) and use np.where() to apply the 'value' column to any cells that returned a 1 when we did pd.get_dummies() earlier.
  4. Finally, you are ready to do a .groupby() of the 'variable' column, summarizing the data by the three cities ('NY', 'LA', 'CHI').

    import pandas as pd, numpy as np
    df1 = df.melt(id_vars=['Person', 'Gender', 'Citizenship', 'Age'])
    a = pd.get_dummies(df1, columns=['Gender', 'Citizenship', 'Age'], dtype='int')
    for col in a.columns.to_list():
        if col != ['value'] and a[col].dtype == 'int':
            a[col] = np.where(a[col] == 1, a['value'], a[col])
    b = a.groupby('variable').sum().reset_index().rename({'variable' : 'rownames'}, axis=1).drop('value', axis=1)
    b
    

Upvotes: 2

Related Questions