Joe Stuart
Joe Stuart

Reputation: 33

Pandas Groupby / List to Multiple Rows

In this example I have 7 columns total per row. I groupby AccountID and Last Name. Grouping by AccountID and Last Name identifies the same person; the different rows values of Contract, Address, City, and State represents a new location for the AccountID/Last Name.

I would like AccountID/Last Name on one line alongside one or more sets of Contract, Address, City, and State.

Current data looks like this:

Contract AccountID Last Name First Name Address City State
622 1234 Pitt Brad 466 7th Ave Park Slope NY
28974 1234 Pitt Brad 1901 Vine Street Philadelphia PA
54122 4321 Ford Henry 93 Booth Dr Nutley NJ
622 2345 Rhodes Dusty 1 Public Library Plaze Stamford CT
28974 2345 Rhodes Dusty 1001 Kings Highway Cherry Hill NJ
54122 2345 Rhodes Dusty 444 Amsterdamn Ave Upper West Side NY

Would like to display the data like this:

AccountID Last Name First Name Contract.1 Address_1 City_1 State_1 Contract_2 Address_2 City_2 State_2 Contract_3 Address_3 City_3 State_3
1234 Pitt Brad 622 466 7th Ave Park Slope NY 28974.0 1901 Vine Street Philadelphia PA
4321 Ford Henry 54122 93 Booth Dr Nutley NJ
2345 Rhodes Dusty 622 1 Public Library Plaze Stamford CT 28974.0 1001 Kings Highway Cherry Hill NJ 54122.0 444 Amsterdamn Ave Upper West Side NY

Here is what I've done so far. Steps 5 and on I have been reworking for a week. Any suggestions?

# Step 1
import pandas as pd
import numpy as np
# read from "my clipboard"
df = pd.read_clipboard()
df

#Step 2
df['Contract_State'] = (df['Contract'].astype(str) + '|' + df['Address']  + '|' + df['City']  + '|' + df['State']).str.split()
df['Contract'] = df['Contract'].astype(str)
df['AccountID'] = df['AccountID'].astype(str)

# Step 3 - groupby
df2 = pd.DataFrame(df.groupby(['AccountID', 'Last Name']).Contract_State.apply(list)).reset_index()
df2

# Step 4 - flatten the lists
df2['Contract_State'] = df2['Contract_State'].apply(lambda x: np.array(x).flatten())
df2

# Step 5 - The number of elements in lists each list is always even => /2
num_columns = df2['Contract_State'].apply(len).max()
num_columns 

# Step 6
df3 = pd.DataFrame(list(df2['Contract_State']), columns=columns)
df3

# Step 7 - concatenate df2 with contracts, then drop the column "Contract_State"
df4 = pd.concat([df2, df3], join='inner', axis='columns').drop('Contract_State', axis='columns')
df4

Upvotes: 3

Views: 1786

Answers (4)

Henry Ecker
Henry Ecker

Reputation: 35636

We can pass a series directly to pivot_table with aggfunc first. Use groupby cumcount to enumerate group rows which become the new column suffixes:

cols = ['AccountID', 'Last Name', 'First Name']
dfp = (
    df.pivot_table(
        index=cols,
        columns=df.groupby(cols).cumcount() + 1,
        aggfunc='first'
    ).sort_index(axis=1, level=1, sort_remaining=False)
)
# Collapse Multi-Index
dfp.columns = dfp.columns.map(lambda t: '_'.join(map(str, t)))
dfp = dfp.reset_index()

Or with set_index + unstack without groubpy first since cumcount ensures unique columns:

cols = ['AccountID', 'Last Name', 'First Name']
dfw = df.set_index(
    [*cols, df.groupby(cols).cumcount() + 1]
).unstack().sort_index(axis=1, level=1, sort_remaining=False)
# Collapse Multi-Index
dfw.columns = dfw.columns.map(lambda t: '_'.join(map(str, t)))
dfu = dfw.reset_index()

Either option produces:

   AccountID Last Name First Name               Address_1      City_1  Contract_1 State_1           Address_2        City_2  Contract_2 State_2           Address_3           City_3  Contract_3 State_3
0       1234      Pitt       Brad             466 7th Ave  Park Slope       622.0      NY    1901 Vine Street  Philadelphia     28974.0      PA                 NaN              NaN         NaN     NaN
1       2345    Rhodes      Dusty  1 Public Library Plaze    Stamford       622.0      CT  1001 Kings Highway   Cherry Hill     28974.0      NJ  444 Amsterdamn Ave  Upper West Side     54122.0      NY
2       4321      Ford      Henry             93 Booth Dr      Nutley     54122.0      NJ                 NaN           NaN         NaN     NaN                 NaN              NaN         NaN     NaN

The pyjanitor module has an abstraction for this operation called pivot_wider which can hide away the collapsing of the MultiIndex and the restoration of the index columns:

# pip install pyjanitor
# conda install pyjanitor -c conda-forge
import janitor
import pandas as pd


cols = ['AccountID', 'Last Name', 'First Name']
dfw = (
    df.add_column(
        'group', df.groupby(cols).cumcount() + 1
    ).pivot_wider(
        index=cols,
        names_from='group'
    )
)
   AccountID Last Name First Name  Contract_1  Contract_2  Contract_3               Address_1           Address_2           Address_3      City_1        City_2           City_3 State_1 State_2 State_3
0       1234      Pitt       Brad       622.0     28974.0         NaN             466 7th Ave    1901 Vine Street                 NaN  Park Slope  Philadelphia              NaN      NY      PA     NaN
1       2345    Rhodes      Dusty       622.0     28974.0     54122.0  1 Public Library Plaze  1001 Kings Highway  444 Amsterdamn Ave    Stamford   Cherry Hill  Upper West Side      CT      NJ      NY
2       4321      Ford      Henry     54122.0         NaN         NaN             93 Booth Dr                 NaN                 NaN      Nutley           NaN              NaN      NJ     NaN     NaN

There is also an abstraction to handle the collapsing of the MultiIndex janitor.collapse_levels which can be used with the pandas operations to create a cleaner look without giving up the flexibility offered by pivot_table and sort_index:

cols = ['AccountID', 'Last Name', 'First Name']
dfp = (
    df.pivot_table(
        index=cols,
        columns=df.groupby(cols).cumcount() + 1,
        aggfunc='first'
    ).sort_index(
        axis=1, level=1, sort_remaining=False
    ).collapse_levels(sep='_').reset_index()
)

dfp:

   AccountID Last Name First Name               Address_1      City_1  Contract_1 State_1           Address_2        City_2  Contract_2 State_2           Address_3           City_3  Contract_3 State_3
0       1234      Pitt       Brad             466 7th Ave  Park Slope       622.0      NY    1901 Vine Street  Philadelphia     28974.0      PA                 NaN              NaN         NaN     NaN
1       2345    Rhodes      Dusty  1 Public Library Plaze    Stamford       622.0      CT  1001 Kings Highway   Cherry Hill     28974.0      NJ  444 Amsterdamn Ave  Upper West Side     54122.0      NY
2       4321      Ford      Henry             93 Booth Dr      Nutley     54122.0      NJ                 NaN           NaN         NaN     NaN                 NaN              NaN         NaN     NaN

Upvotes: 5

Pygirl
Pygirl

Reputation: 13349

You can try using groupby and unstack:

grp_col = ['AccountID', 'Last Name', 'First Name']
df['num'] = df.groupby(grp_col).cumcount()+1
res = df.set_index([*grp_col, 'num']).unstack('num').sort_index(axis=1, level=1).reset_index()
res.columns = res.columns.map(lambda x: f"{x[0]}{x[1]}")

res:

AccountID Last Name First Name Contract1 Address1 City1 State1 Contract2 Address2 City2 State2 Contract3 Address3 City3 State3
0 1234 Pitt Brad 622.0 466 7th Ave Park Slope NY 28974.0 1901 Vine Street Philadelphia PA NaN NaN NaN NaN
1 2345 Rhodes Dusty 622.0 1 Public Library Plaze Stamford CT 28974.0 1001 Kings Highway Cherry Hill NJ 54122.0 444 Amsterdamn Ave Upper West Side NY
2 4321 Ford Henry 54122.0 93 Booth Dr Nutley NJ NaN NaN NaN NaN NaN NaN NaN NaN

Upvotes: 4

not_speshal
not_speshal

Reputation: 23146

Try with groupby and pivot_table:

df["group"] = df.groupby(["AccountID", "Last Name", "First Name"]).cumcount()+1
output = df.pivot_table(index=["AccountID", "Last Name", "First Name"], 
                        columns='group', 
                        values=['Address', 'City', "State"], 
                        aggfunc='first')
output = output.sort_index(axis=1, level=1)
output.columns = [f"{i}_{j}" for i, j in output.columns]
output = output.reset_index()

Upvotes: 4

Scott Boston
Scott Boston

Reputation: 153460

IIUC, I think you can do it like this:

dfg = df.groupby(['AccountID', 'Last Name', df.groupby(['AccountID', 'Last Name']).cumcount() + 1]).first().unstack()
dfg.columns = [f'{i}{j}' for i, j in dfg.columns]
df_out = dfg.sort_index(axis=1, key=lambda x: x.str[-1])
df_out.reset_index()

Output:

   AccountID Last Name  Contract1 First Name1                Address1       City1 State1  Contract2 First Name2            Address2         City2 State2  Contract3 First Name3            Address3            City3 State3
0       1234      Pitt      622.0        Brad             466 7th Ave  Park Slope     NY    28974.0        Brad    1901 Vine Street  Philadelphia     PA        NaN         NaN                 NaN              NaN    NaN
1       2345    Rhodes      622.0       Dusty  1 Public Library Plaze    Stamford     CT    28974.0       Dusty  1001 Kings Highway   Cherry Hill     NJ    54122.0       Dusty  444 Amsterdamn Ave  Upper West Side     NY
2       4321      Ford    54122.0       Henry             93 Booth Dr      Nutley     NJ        NaN         NaN                 NaN           NaN    NaN        NaN         NaN                 NaN              NaN    NaN

Upvotes: 7

Related Questions