Reputation: 2539
Starting with this dataframe of train trip segments:
df=pd.DataFrame({ 'Name': ['Susie', 'Susie', 'Frank', 'Tony', 'Tony'],
'Trip Id': [1, 1, 2, 3, 3], 'From': ['London', 'Paris', 'Lyon', 'Munich', 'Prague'],
'To': ['Paris', 'Berlin', 'Milan', 'Prague', 'Vienna'],
'Passenger Count': [1, 1, 2, 4, 4]})
Name Trip Id From To Passenger Count
Susie 1 London Paris 1
Susie 1 Paris Berlin 1
Frank 2 Lyon Milan 2
Tony 3 Munich Prague 4
Tony 3 Prague Vienna 4
(Note: a trip is a number of associated segments that forms 1 travel activity, think changing trains.)
I need to expand and remove the passenger counts to achieve a one-row-per-person-segment dataframe.
Each anonymous segment should list the reference passenger. Every traveler needs their own Trip Id
.
The result should look like this:
Name Trip Id From To Named Passenger
Susie 1 London Paris NaN
Susie 1 Paris Berlin NaN
Frank 2 Lyon Milan NaN
NaN 4 Lyon Milan Frank
Tony 3 Munich Prague NaN
Tony 3 Prague Vienna NaN
NaN 5 Munich Prague Tony
NaN 5 Prague Vienna Tony
NaN 6 Munich Prague Tony
NaN 6 Prague Vienna Tony
NaN 7 Munich Prague Tony
NaN 7 Prague Vienna Tony
I almost achieved this, but am struggling with getting every person to have their own trip id
.
I first managed to expand the passengers like this:
# First, setting the reference name for all records
df['Named Passenger'] = df.apply(lambda r: r['Name'], axis=1)
# Creating an expansion index.
new_index = df.index.repeat(df['Passenger Count'])
# Expanding the df
expanded = df.loc[new_index]
# Removing again the reference name for the original rows
expanded.loc[~new_index.duplicated(), 'Named Passenger'] = np.nan
# And removing the Name on duplicated rows (>1 personal info columns in reality)
expanded.loc[new_index.duplicated(), 'Name'] = np.nan
expanded = expanded.reset_index(drop=True)
expanded.drop(columns=['Passenger Count'], inplace=True)
expanded
now looks like this:
Name Trip Id From To Named Passenger
0 Susie 1 London Paris NaN
1 Susie 1 Paris Berlin NaN
2 Frank 2 Lyon Milan NaN
3 NaN 2 Lyon Milan Frank
4 Tony 3 Munich Prague NaN
5 NaN 3 Munich Prague Tony
6 NaN 3 Munich Prague Tony
7 NaN 3 Munich Prague Tony
8 Tony 3 Prague Vienna NaN
9 NaN 3 Prague Vienna Tony
10 NaN 3 Prague Vienna Tony
11 NaN 3 Prague Vienna Tony
..but I have no idea how to correctly update the Trip Id now? (It doesn't matter what it is, as long as it's unique per passenger.)
Upvotes: 1
Views: 64
Reputation: 9197
You could combine range and explode. Does this work for you?
import pandas as pd
import numpy as np
df=pd.DataFrame({ 'Name': ['Susie', 'Susie', 'Frank', 'Tony', 'Tony'],
'Trip Id': [1, 1, 2, 3, 3], 'From': ['London', 'Paris', 'Lyon', 'Munich', 'Prague'],
'To': ['Paris', 'Berlin', 'Milan', 'Prague', 'Vienna'],
'Passenger Count': [1, 1, 2, 4, 4]})
# Expand rows
df['Named Passenger'] = df['Name'].copy()
df['Passenger Id'] = df['Passenger Count'].map(lambda x: list(range(1, x+1)))
df = df.explode('Passenger Id')
df['Trip Id Unique'] = df['Trip Id'].astype(str) + "_" + df['Passenger Id'].astype(str)
# Remove names
df['Name'] = np.where(~df.index.duplicated(keep='first'), df['Name'], np.nan)
df['Named Passenger'] = np.where(df['Name'].isnull(), df['Named Passenger'], np.nan)
# Encode Trip ID with unique numerical ID
df['Trip Id Unique'] = pd.Categorical(df['Trip Id Unique'])
df['Trip Id Unique'] = df['Trip Id Unique'].cat.codes +1
# Print
df[['Name', 'Trip Id Unique', 'From', 'To', 'Named Passenger']].sort_values(by=['Trip Id Unique', 'From'])
# Name Trip Id Unique From To Named Passenger
# 0 Susie 1 London Paris NaN
# 1 Susie 1 Paris Berlin NaN
# 2 Frank 2 Lyon Milan NaN
# 2 NaN 3 Lyon Milan Frank
# 3 Tony 4 Munich Prague NaN
# 4 Tony 4 Prague Vienna NaN
# 3 NaN 5 Munich Prague Tony
# 4 NaN 5 Prague Vienna Tony
# 3 NaN 6 Munich Prague Tony
# 4 NaN 6 Prague Vienna Tony
# 3 NaN 7 Munich Prague Tony
# 4 NaN 7 Prague Vienna Tony
Upvotes: 1