bebbi
bebbi

Reputation: 2539

Increase a column value of a duplicate row depending on its index in the list of duplicated rows

Starting with this dataframe of train trip segments:

df=pd.DataFrame({ 'Name': ['Susie', 'Susie', 'Frank', 'Tony', 'Tony'],
  'Trip Id': [1, 1, 2, 3, 3], 'From': ['London', 'Paris', 'Lyon', 'Munich', 'Prague'],
  'To': ['Paris', 'Berlin', 'Milan', 'Prague', 'Vienna'],
  'Passenger Count': [1, 1, 2, 4, 4]})
Name     Trip Id    From      To        Passenger Count
Susie    1          London    Paris     1
Susie    1          Paris     Berlin    1
Frank    2          Lyon      Milan     2
Tony     3          Munich    Prague    4
Tony     3          Prague    Vienna    4

(Note: a trip is a number of associated segments that forms 1 travel activity, think changing trains.)

I need to expand and remove the passenger counts to achieve a one-row-per-person-segment dataframe. Each anonymous segment should list the reference passenger. Every traveler needs their own Trip Id. The result should look like this:

Name     Trip Id    From      To        Named Passenger
Susie    1          London    Paris     NaN
Susie    1          Paris     Berlin    NaN
Frank    2          Lyon      Milan     NaN
NaN      4          Lyon      Milan     Frank
Tony     3          Munich    Prague    NaN
Tony     3          Prague    Vienna    NaN
NaN      5          Munich    Prague    Tony
NaN      5          Prague    Vienna    Tony
NaN      6          Munich    Prague    Tony
NaN      6          Prague    Vienna    Tony
NaN      7          Munich    Prague    Tony
NaN      7          Prague    Vienna    Tony

I almost achieved this, but am struggling with getting every person to have their own trip id.

I first managed to expand the passengers like this:

# First, setting the reference name for all records
df['Named Passenger'] = df.apply(lambda r: r['Name'], axis=1)
# Creating an expansion index.
new_index = df.index.repeat(df['Passenger Count'])
# Expanding the df
expanded = df.loc[new_index]
# Removing again the reference name for the original rows
expanded.loc[~new_index.duplicated(), 'Named Passenger'] = np.nan
# And removing the Name on duplicated rows (>1 personal info columns in reality)
expanded.loc[new_index.duplicated(), 'Name'] = np.nan
expanded = expanded.reset_index(drop=True)
expanded.drop(columns=['Passenger Count'], inplace=True)

expanded now looks like this:

     Name  Trip Id    From      To Named Passenger
0   Susie        1  London   Paris             NaN
1   Susie        1   Paris  Berlin             NaN
2   Frank        2    Lyon   Milan             NaN
3     NaN        2    Lyon   Milan           Frank
4    Tony        3  Munich  Prague             NaN
5     NaN        3  Munich  Prague            Tony
6     NaN        3  Munich  Prague            Tony
7     NaN        3  Munich  Prague            Tony
8    Tony        3  Prague  Vienna             NaN
9     NaN        3  Prague  Vienna            Tony
10    NaN        3  Prague  Vienna            Tony
11    NaN        3  Prague  Vienna            Tony

..but I have no idea how to correctly update the Trip Id now? (It doesn't matter what it is, as long as it's unique per passenger.)

Upvotes: 1

Views: 64

Answers (1)

Andreas
Andreas

Reputation: 9197

You could combine range and explode. Does this work for you?

import pandas as pd
import numpy as np
df=pd.DataFrame({ 'Name': ['Susie', 'Susie', 'Frank', 'Tony', 'Tony'],
  'Trip Id': [1, 1, 2, 3, 3], 'From': ['London', 'Paris', 'Lyon', 'Munich', 'Prague'],
  'To': ['Paris', 'Berlin', 'Milan', 'Prague', 'Vienna'],
  'Passenger Count': [1, 1, 2, 4, 4]})

# Expand rows
df['Named Passenger'] = df['Name'].copy()
df['Passenger Id'] = df['Passenger Count'].map(lambda x: list(range(1, x+1)))
df = df.explode('Passenger Id')
df['Trip Id Unique'] = df['Trip Id'].astype(str) + "_" + df['Passenger Id'].astype(str)

# Remove names
df['Name'] = np.where(~df.index.duplicated(keep='first'), df['Name'], np.nan)
df['Named Passenger'] = np.where(df['Name'].isnull(), df['Named Passenger'], np.nan)

# Encode Trip ID with unique numerical ID
df['Trip Id Unique'] = pd.Categorical(df['Trip Id Unique'])
df['Trip Id Unique'] = df['Trip Id Unique'].cat.codes +1

# Print
df[['Name', 'Trip Id Unique', 'From', 'To', 'Named Passenger']].sort_values(by=['Trip Id Unique', 'From'])

#     Name  Trip Id Unique    From      To Named Passenger
# 0  Susie               1  London   Paris             NaN
# 1  Susie               1   Paris  Berlin             NaN
# 2  Frank               2    Lyon   Milan             NaN
# 2    NaN               3    Lyon   Milan           Frank
# 3   Tony               4  Munich  Prague             NaN
# 4   Tony               4  Prague  Vienna             NaN
# 3    NaN               5  Munich  Prague            Tony
# 4    NaN               5  Prague  Vienna            Tony
# 3    NaN               6  Munich  Prague            Tony
# 4    NaN               6  Prague  Vienna            Tony
# 3    NaN               7  Munich  Prague            Tony
# 4    NaN               7  Prague  Vienna            Tony

Upvotes: 1

Related Questions