In this example I have 7 columns total per row. I groupby AccountID and Last Name. Grouping by AccountID and Last Name identifies the same person; the different rows values of Contract, Address, City, and State represents a new location for the AccountID/Last Name. I would like AccountID/Last Name on one line alongside one or more sets of Contract, Address, City, and State. Current data looks like this: Contract AccountID Last Name First Name Address City State 622 1234 Pitt Brad 466 7th Ave Park Slope NY 28974 1234 Pitt Brad 1901 Vine Street Philadelphia PA 54122 4321 Ford Henry 93 Booth Dr Nutley NJ 622 2345 Rhodes Dusty 1 Public Library Plaze Stamford CT 28974 2345 Rhodes Dusty 1001 Kings Highway Cherry Hill NJ 54122 2345 Rhodes Dusty 444 Amsterdamn Ave Upper West Side NY Would like to display the data like this: AccountID Last Name First Name Contract.1 Address_1 City_1 State_1 Contract_2 Address_2 City_2 State_2 Contract_3 Address_3 City_3 State_3 1234 Pitt Brad 622 466 7th Ave Park Slope NY 28974.0 1901 Vine Street Philadelphia PA 4321 Ford Henry 54122 93 Booth Dr Nutley NJ 2345 Rhodes Dusty 622 1 Public Library Plaze Stamford CT 28974.0 1001 Kings Highway Cherry Hill NJ 54122.0 444 Amsterdamn Ave Upper West Side NY Here is what I've done so far. Steps 5 and on I have been reworking for a week. Any suggestions? # Step 1 import pandas as pd import numpy as np # read from "my clipboard" df = pd.read_clipboard() df #Step 2 df['Contract_State'] = (df['Contract'].astype(str) + '|' + df['Address'] + '|' + df['City'] + '|' + df['State']).str.split() df['Contract'] = df['Contract'].astype(str) df['AccountID'] = df['AccountID'].astype(str) # Step 3 - groupby df2 = pd.DataFrame(df.groupby(['AccountID', 'Last Name']).Contract_State.apply(list)).reset_index() df2 # Step 4 - flatten the lists df2['Contract_State'] = df2['Contract_State'].apply(lambda x: np.array(x).flatten()) df2 # Step 5 - The number of elements in lists each list is always even => /2 num_columns = df2['Contract_State'].apply(len).max() num_columns # Step 6 df3 = pd.DataFrame(list(df2['Contract_State']), columns=columns) df3 # Step 7 - concatenate df2 with contracts, then drop the column "Contract_State" df4 = pd.concat([df2, df3], join='inner', axis='columns').drop('Contract_State', axis='columns') df4

Reputation: 33

Pandas Groupby / List to Multiple Rows

In this example I have 7 columns total per row. I groupby AccountID and Last Name. Grouping by AccountID and Last Name identifies the same person; the different rows values of Contract, Address, City, and State represents a new location for the AccountID/Last Name.

I would like AccountID/Last Name on one line alongside one or more sets of Contract, Address, City, and State.

Current data looks like this:

Contract	AccountID	Last Name	First Name	Address	City	State
622	1234	Pitt	Brad	466 7th Ave	Park Slope	NY
28974	1234	Pitt	Brad	1901 Vine Street	Philadelphia	PA
54122	4321	Ford	Henry	93 Booth Dr	Nutley	NJ
622	2345	Rhodes	Dusty	1 Public Library Plaze	Stamford	CT
28974	2345	Rhodes	Dusty	1001 Kings Highway	Cherry Hill	NJ
54122	2345	Rhodes	Dusty	444 Amsterdamn Ave	Upper West Side	NY

Would like to display the data like this:

AccountID	Last Name	First Name	Contract.1	Address_1	City_1	State_1	Contract_2	Address_2	City_2	State_2	Contract_3	Address_3	City_3	State_3
1234	Pitt	Brad	622	466 7th Ave	Park Slope	NY	28974.0	1901 Vine Street	Philadelphia	PA
4321	Ford	Henry	54122	93 Booth Dr	Nutley	NJ
2345	Rhodes	Dusty	622	1 Public Library Plaze	Stamford	CT	28974.0	1001 Kings Highway	Cherry Hill	NJ	54122.0	444 Amsterdamn Ave	Upper West Side	NY

Here is what I've done so far. Steps 5 and on I have been reworking for a week. Any suggestions?

# Step 1
import pandas as pd
import numpy as np
# read from "my clipboard"
df = pd.read_clipboard()
df

#Step 2
df['Contract_State'] = (df['Contract'].astype(str) + '|' + df['Address']  + '|' + df['City']  + '|' + df['State']).str.split()
df['Contract'] = df['Contract'].astype(str)
df['AccountID'] = df['AccountID'].astype(str)

# Step 3 - groupby
df2 = pd.DataFrame(df.groupby(['AccountID', 'Last Name']).Contract_State.apply(list)).reset_index()
df2

# Step 4 - flatten the lists
df2['Contract_State'] = df2['Contract_State'].apply(lambda x: np.array(x).flatten())
df2

# Step 5 - The number of elements in lists each list is always even => /2
num_columns = df2['Contract_State'].apply(len).max()
num_columns 

# Step 6
df3 = pd.DataFrame(list(df2['Contract_State']), columns=columns)
df3

# Step 7 - concatenate df2 with contracts, then drop the column "Contract_State"
df4 = pd.concat([df2, df3], join='inner', axis='columns').drop('Contract_State', axis='columns')
df4

Upvotes: 3

Answers (4)

Henry Ecker

Reputation: 35636

We can pass a series directly to pivot_table with aggfunc first. Use groupby cumcount to enumerate group rows which become the new column suffixes:

cols = ['AccountID', 'Last Name', 'First Name']
dfp = (
    df.pivot_table(
        index=cols,
        columns=df.groupby(cols).cumcount() + 1,
        aggfunc='first'
    ).sort_index(axis=1, level=1, sort_remaining=False)
)
# Collapse Multi-Index
dfp.columns = dfp.columns.map(lambda t: '_'.join(map(str, t)))
dfp = dfp.reset_index()

Or with set_index + unstack without groubpy first since cumcount ensures unique columns:

cols = ['AccountID', 'Last Name', 'First Name']
dfw = df.set_index(
    [*cols, df.groupby(cols).cumcount() + 1]
).unstack().sort_index(axis=1, level=1, sort_remaining=False)
# Collapse Multi-Index
dfw.columns = dfw.columns.map(lambda t: '_'.join(map(str, t)))
dfu = dfw.reset_index()

Either option produces:

   AccountID Last Name First Name               Address_1      City_1  Contract_1 State_1           Address_2        City_2  Contract_2 State_2           Address_3           City_3  Contract_3 State_3
0       1234      Pitt       Brad             466 7th Ave  Park Slope       622.0      NY    1901 Vine Street  Philadelphia     28974.0      PA                 NaN              NaN         NaN     NaN
1       2345    Rhodes      Dusty  1 Public Library Plaze    Stamford       622.0      CT  1001 Kings Highway   Cherry Hill     28974.0      NJ  444 Amsterdamn Ave  Upper West Side     54122.0      NY
2       4321      Ford      Henry             93 Booth Dr      Nutley     54122.0      NJ                 NaN           NaN         NaN     NaN                 NaN              NaN         NaN     NaN

The pyjanitor module has an abstraction for this operation called pivot_wider which can hide away the collapsing of the MultiIndex and the restoration of the index columns:

# pip install pyjanitor
# conda install pyjanitor -c conda-forge
import janitor
import pandas as pd


cols = ['AccountID', 'Last Name', 'First Name']
dfw = (
    df.add_column(
        'group', df.groupby(cols).cumcount() + 1
    ).pivot_wider(
        index=cols,
        names_from='group'
    )
)

   AccountID Last Name First Name  Contract_1  Contract_2  Contract_3               Address_1           Address_2           Address_3      City_1        City_2           City_3 State_1 State_2 State_3
0       1234      Pitt       Brad       622.0     28974.0         NaN             466 7th Ave    1901 Vine Street                 NaN  Park Slope  Philadelphia              NaN      NY      PA     NaN
1       2345    Rhodes      Dusty       622.0     28974.0     54122.0  1 Public Library Plaze  1001 Kings Highway  444 Amsterdamn Ave    Stamford   Cherry Hill  Upper West Side      CT      NJ      NY
2       4321      Ford      Henry     54122.0         NaN         NaN             93 Booth Dr                 NaN                 NaN      Nutley           NaN              NaN      NJ     NaN     NaN

There is also an abstraction to handle the collapsing of the MultiIndex janitor.collapse_levels which can be used with the pandas operations to create a cleaner look without giving up the flexibility offered by pivot_table and sort_index:

cols = ['AccountID', 'Last Name', 'First Name']
dfp = (
    df.pivot_table(
        index=cols,
        columns=df.groupby(cols).cumcount() + 1,
        aggfunc='first'
    ).sort_index(
        axis=1, level=1, sort_remaining=False
    ).collapse_levels(sep='_').reset_index()
)

dfp:

   AccountID Last Name First Name               Address_1      City_1  Contract_1 State_1           Address_2        City_2  Contract_2 State_2           Address_3           City_3  Contract_3 State_3
0       1234      Pitt       Brad             466 7th Ave  Park Slope       622.0      NY    1901 Vine Street  Philadelphia     28974.0      PA                 NaN              NaN         NaN     NaN
1       2345    Rhodes      Dusty  1 Public Library Plaze    Stamford       622.0      CT  1001 Kings Highway   Cherry Hill     28974.0      NJ  444 Amsterdamn Ave  Upper West Side     54122.0      NY
2       4321      Ford      Henry             93 Booth Dr      Nutley     54122.0      NJ                 NaN           NaN         NaN     NaN                 NaN              NaN         NaN     NaN

Upvotes: 5

Pygirl

Reputation: 13349

You can try using groupby and unstack:

grp_col = ['AccountID', 'Last Name', 'First Name']
df['num'] = df.groupby(grp_col).cumcount()+1
res = df.set_index([*grp_col, 'num']).unstack('num').sort_index(axis=1, level=1).reset_index()
res.columns = res.columns.map(lambda x: f"{x[0]}{x[1]}")

res:

	AccountID	Last Name	First Name	Contract1	Address1	City1	State1	Contract2	Address2	City2	State2	Contract3	Address3	City3	State3
0	1234	Pitt	Brad	622.0	466 7th Ave	Park Slope	NY	28974.0	1901 Vine Street	Philadelphia	PA	NaN	NaN	NaN	NaN
1	2345	Rhodes	Dusty	622.0	1 Public Library Plaze	Stamford	CT	28974.0	1001 Kings Highway	Cherry Hill	NJ	54122.0	444 Amsterdamn Ave	Upper West Side	NY
2	4321	Ford	Henry	54122.0	93 Booth Dr	Nutley	NJ	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Upvotes: 4

not_speshal

Reputation: 23146

Try with groupby and pivot_table:

df["group"] = df.groupby(["AccountID", "Last Name", "First Name"]).cumcount()+1
output = df.pivot_table(index=["AccountID", "Last Name", "First Name"], 
                        columns='group', 
                        values=['Address', 'City', "State"], 
                        aggfunc='first')
output = output.sort_index(axis=1, level=1)
output.columns = [f"{i}_{j}" for i, j in output.columns]
output = output.reset_index()

Upvotes: 4

Scott Boston

Reputation: 153460

IIUC, I think you can do it like this:

dfg = df.groupby(['AccountID', 'Last Name', df.groupby(['AccountID', 'Last Name']).cumcount() + 1]).first().unstack()
dfg.columns = [f'{i}{j}' for i, j in dfg.columns]
df_out = dfg.sort_index(axis=1, key=lambda x: x.str[-1])
df_out.reset_index()

Output:

   AccountID Last Name  Contract1 First Name1                Address1       City1 State1  Contract2 First Name2            Address2         City2 State2  Contract3 First Name3            Address3            City3 State3
0       1234      Pitt      622.0        Brad             466 7th Ave  Park Slope     NY    28974.0        Brad    1901 Vine Street  Philadelphia     PA        NaN         NaN                 NaN              NaN    NaN
1       2345    Rhodes      622.0       Dusty  1 Public Library Plaze    Stamford     CT    28974.0       Dusty  1001 Kings Highway   Cherry Hill     NJ    54122.0       Dusty  444 Amsterdamn Ave  Upper West Side     NY
2       4321      Ford    54122.0       Henry             93 Booth Dr      Nutley     NJ        NaN         NaN                 NaN           NaN    NaN        NaN         NaN                 NaN              NaN    NaN

Upvotes: 7

Pandas Groupby / List to Multiple Rows

Answers (4)

Related Questions