Reputation: 33
In this example I have 7 columns total per row. I groupby AccountID and Last Name. Grouping by AccountID and Last Name identifies the same person; the different rows values of Contract, Address, City, and State represents a new location for the AccountID/Last Name.
I would like AccountID/Last Name on one line alongside one or more sets of Contract, Address, City, and State.
Current data looks like this:
Contract | AccountID | Last Name | First Name | Address | City | State |
---|---|---|---|---|---|---|
622 | 1234 | Pitt | Brad | 466 7th Ave | Park Slope | NY |
28974 | 1234 | Pitt | Brad | 1901 Vine Street | Philadelphia | PA |
54122 | 4321 | Ford | Henry | 93 Booth Dr | Nutley | NJ |
622 | 2345 | Rhodes | Dusty | 1 Public Library Plaze | Stamford | CT |
28974 | 2345 | Rhodes | Dusty | 1001 Kings Highway | Cherry Hill | NJ |
54122 | 2345 | Rhodes | Dusty | 444 Amsterdamn Ave | Upper West Side | NY |
Would like to display the data like this:
AccountID | Last Name | First Name | Contract.1 | Address_1 | City_1 | State_1 | Contract_2 | Address_2 | City_2 | State_2 | Contract_3 | Address_3 | City_3 | State_3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1234 | Pitt | Brad | 622 | 466 7th Ave | Park Slope | NY | 28974.0 | 1901 Vine Street | Philadelphia | PA | ||||
4321 | Ford | Henry | 54122 | 93 Booth Dr | Nutley | NJ | ||||||||
2345 | Rhodes | Dusty | 622 | 1 Public Library Plaze | Stamford | CT | 28974.0 | 1001 Kings Highway | Cherry Hill | NJ | 54122.0 | 444 Amsterdamn Ave | Upper West Side | NY |
Here is what I've done so far. Steps 5 and on I have been reworking for a week. Any suggestions?
# Step 1
import pandas as pd
import numpy as np
# read from "my clipboard"
df = pd.read_clipboard()
df
#Step 2
df['Contract_State'] = (df['Contract'].astype(str) + '|' + df['Address'] + '|' + df['City'] + '|' + df['State']).str.split()
df['Contract'] = df['Contract'].astype(str)
df['AccountID'] = df['AccountID'].astype(str)
# Step 3 - groupby
df2 = pd.DataFrame(df.groupby(['AccountID', 'Last Name']).Contract_State.apply(list)).reset_index()
df2
# Step 4 - flatten the lists
df2['Contract_State'] = df2['Contract_State'].apply(lambda x: np.array(x).flatten())
df2
# Step 5 - The number of elements in lists each list is always even => /2
num_columns = df2['Contract_State'].apply(len).max()
num_columns
# Step 6
df3 = pd.DataFrame(list(df2['Contract_State']), columns=columns)
df3
# Step 7 - concatenate df2 with contracts, then drop the column "Contract_State"
df4 = pd.concat([df2, df3], join='inner', axis='columns').drop('Contract_State', axis='columns')
df4
Upvotes: 3
Views: 1786
Reputation: 35636
We can pass a series directly to pivot_table
with aggfunc
first. Use groupby cumcount
to enumerate group rows which become the new column suffixes:
cols = ['AccountID', 'Last Name', 'First Name']
dfp = (
df.pivot_table(
index=cols,
columns=df.groupby(cols).cumcount() + 1,
aggfunc='first'
).sort_index(axis=1, level=1, sort_remaining=False)
)
# Collapse Multi-Index
dfp.columns = dfp.columns.map(lambda t: '_'.join(map(str, t)))
dfp = dfp.reset_index()
Or with set_index
+ unstack
without groubpy first
since cumcount ensures unique columns:
cols = ['AccountID', 'Last Name', 'First Name']
dfw = df.set_index(
[*cols, df.groupby(cols).cumcount() + 1]
).unstack().sort_index(axis=1, level=1, sort_remaining=False)
# Collapse Multi-Index
dfw.columns = dfw.columns.map(lambda t: '_'.join(map(str, t)))
dfu = dfw.reset_index()
Either option produces:
AccountID Last Name First Name Address_1 City_1 Contract_1 State_1 Address_2 City_2 Contract_2 State_2 Address_3 City_3 Contract_3 State_3
0 1234 Pitt Brad 466 7th Ave Park Slope 622.0 NY 1901 Vine Street Philadelphia 28974.0 PA NaN NaN NaN NaN
1 2345 Rhodes Dusty 1 Public Library Plaze Stamford 622.0 CT 1001 Kings Highway Cherry Hill 28974.0 NJ 444 Amsterdamn Ave Upper West Side 54122.0 NY
2 4321 Ford Henry 93 Booth Dr Nutley 54122.0 NJ NaN NaN NaN NaN NaN NaN NaN NaN
The pyjanitor module has an abstraction for this operation called pivot_wider
which can hide away the collapsing of the MultiIndex and the restoration of the index columns:
# pip install pyjanitor
# conda install pyjanitor -c conda-forge
import janitor
import pandas as pd
cols = ['AccountID', 'Last Name', 'First Name']
dfw = (
df.add_column(
'group', df.groupby(cols).cumcount() + 1
).pivot_wider(
index=cols,
names_from='group'
)
)
AccountID Last Name First Name Contract_1 Contract_2 Contract_3 Address_1 Address_2 Address_3 City_1 City_2 City_3 State_1 State_2 State_3
0 1234 Pitt Brad 622.0 28974.0 NaN 466 7th Ave 1901 Vine Street NaN Park Slope Philadelphia NaN NY PA NaN
1 2345 Rhodes Dusty 622.0 28974.0 54122.0 1 Public Library Plaze 1001 Kings Highway 444 Amsterdamn Ave Stamford Cherry Hill Upper West Side CT NJ NY
2 4321 Ford Henry 54122.0 NaN NaN 93 Booth Dr NaN NaN Nutley NaN NaN NJ NaN NaN
There is also an abstraction to handle the collapsing of the MultiIndex janitor.collapse_levels
which can be used with the pandas
operations to create a cleaner look without giving up the flexibility offered by pivot_table
and sort_index
:
cols = ['AccountID', 'Last Name', 'First Name']
dfp = (
df.pivot_table(
index=cols,
columns=df.groupby(cols).cumcount() + 1,
aggfunc='first'
).sort_index(
axis=1, level=1, sort_remaining=False
).collapse_levels(sep='_').reset_index()
)
dfp
:
AccountID Last Name First Name Address_1 City_1 Contract_1 State_1 Address_2 City_2 Contract_2 State_2 Address_3 City_3 Contract_3 State_3
0 1234 Pitt Brad 466 7th Ave Park Slope 622.0 NY 1901 Vine Street Philadelphia 28974.0 PA NaN NaN NaN NaN
1 2345 Rhodes Dusty 1 Public Library Plaze Stamford 622.0 CT 1001 Kings Highway Cherry Hill 28974.0 NJ 444 Amsterdamn Ave Upper West Side 54122.0 NY
2 4321 Ford Henry 93 Booth Dr Nutley 54122.0 NJ NaN NaN NaN NaN NaN NaN NaN NaN
Upvotes: 5
Reputation: 13349
You can try using groupby
and unstack
:
grp_col = ['AccountID', 'Last Name', 'First Name']
df['num'] = df.groupby(grp_col).cumcount()+1
res = df.set_index([*grp_col, 'num']).unstack('num').sort_index(axis=1, level=1).reset_index()
res.columns = res.columns.map(lambda x: f"{x[0]}{x[1]}")
res:
AccountID | Last Name | First Name | Contract1 | Address1 | City1 | State1 | Contract2 | Address2 | City2 | State2 | Contract3 | Address3 | City3 | State3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1234 | Pitt | Brad | 622.0 | 466 7th Ave | Park Slope | NY | 28974.0 | 1901 Vine Street | Philadelphia | PA | NaN | NaN | NaN | NaN |
1 | 2345 | Rhodes | Dusty | 622.0 | 1 Public Library Plaze | Stamford | CT | 28974.0 | 1001 Kings Highway | Cherry Hill | NJ | 54122.0 | 444 Amsterdamn Ave | Upper West Side | NY |
2 | 4321 | Ford | Henry | 54122.0 | 93 Booth Dr | Nutley | NJ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Upvotes: 4
Reputation: 23146
Try with groupby
and pivot_table
:
df["group"] = df.groupby(["AccountID", "Last Name", "First Name"]).cumcount()+1
output = df.pivot_table(index=["AccountID", "Last Name", "First Name"],
columns='group',
values=['Address', 'City', "State"],
aggfunc='first')
output = output.sort_index(axis=1, level=1)
output.columns = [f"{i}_{j}" for i, j in output.columns]
output = output.reset_index()
Upvotes: 4
Reputation: 153460
IIUC, I think you can do it like this:
dfg = df.groupby(['AccountID', 'Last Name', df.groupby(['AccountID', 'Last Name']).cumcount() + 1]).first().unstack()
dfg.columns = [f'{i}{j}' for i, j in dfg.columns]
df_out = dfg.sort_index(axis=1, key=lambda x: x.str[-1])
df_out.reset_index()
Output:
AccountID Last Name Contract1 First Name1 Address1 City1 State1 Contract2 First Name2 Address2 City2 State2 Contract3 First Name3 Address3 City3 State3
0 1234 Pitt 622.0 Brad 466 7th Ave Park Slope NY 28974.0 Brad 1901 Vine Street Philadelphia PA NaN NaN NaN NaN NaN
1 2345 Rhodes 622.0 Dusty 1 Public Library Plaze Stamford CT 28974.0 Dusty 1001 Kings Highway Cherry Hill NJ 54122.0 Dusty 444 Amsterdamn Ave Upper West Side NY
2 4321 Ford 54122.0 Henry 93 Booth Dr Nutley NJ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Upvotes: 7