Pandas - Duplicate rows based on a condition and simultaneously change a column value

Question

I have a sample pandas dataframe (df) that looks like:

Thomas McMann   30     0      0          1172
 Johnny Chien   25     0      1          1172
 Robert McDoe   68     1      0          1172
  Jason Singh  104     1      0          1172
Sarah Jessica   13     0      0          1172

My goal is to duplicate the columns where 'Flag' or 'Flag2' are equal to '1', and then change the assigned_code in the duplicated column as per follows:

if 'Flag' = 1, 'assigned_code' = 1105
if 'Flag2' = 1, 'assigned_code' = 1050

The resulting output would look as follows:

          Name  Age  Flag  Flag2 assigned_code
Thomas McMann   30     0      0          1172
 Johnny Chien   25     0      1          1172
 Johnny Chien   25     0      1          1050
 Robert McDoe   68     1      0          1172
 Robert McDoe   68     1      0          1105
  Jason Singh  104     1      0          1172
  Jason Singh  104     1      0          1105
Sarah Jessica   13     0      0          1172

Any help would be greatly appreciated! I am new to python and have been looking through a bunch of previously answered questions but I can't seem to find a solution that would work for my dataset.

Henry Ecker · Accepted Answer

We can use index.repeat + loc to scale up the DataFrame then reset_index. Then we can use dot to modify the assigned_code based on where index was duplicated and the Flags:

idx = df.index.repeat((df['Flag'] | df['Flag2']) + 1)
df = df.loc[idx].reset_index(drop=True)

m = idx.duplicated()
df.loc[m, 'assigned_code'] = df.loc[m, ['Flag', 'Flag2']].dot([1105, 1050])

df:

            Name  Age  Flag  Flag2  assigned_code
0  Thomas McMann   30     0      0           1172
1   Johnny Chien   25     0      1           1172
2   Johnny Chien   25     0      1           1050
3   Robert McDoe   68     1      0           1172
4   Robert McDoe   68     1      0           1105
5    Jason Singh  104     1      0           1172
6    Jason Singh  104     1      0           1105
7  Sarah Jessica   13     0      0           1172

Explanation:

DataFrame and imports.

import pandas as pd

df = pd.DataFrame({
    'Name': ['Thomas McMann', 'Johnny Chien', 'Robert McDoe', 'Jason Singh',
             'Sarah Jessica'],
    'Age': [30, 25, 68, 104, 13],
    'Flag': [0, 0, 1, 1, 0],
    'Flag2': [0, 1, 0, 0, 0],
    'assigned_code': [1172, 1172, 1172, 1172, 1172]
})

We determine which rows are repeated by logical OR Flag and Flag2:

df['Flag'] | df['Flag2']

0    0
1    1
2    1
3    1
4    0
dtype: int64

Then we add 1 to make 0 repeated 1 time and 1 repeated 2 times:

(df['Flag'] | df['Flag2']) + 1

0    1
1    2
2    2
3    2
4    1
dtype: int64

We then select repeat the index and select from the DataFrame with loc:

df.index.repeat((df['Flag'] | df['Flag2']) + 1)
Int64Index([0, 1, 1, 2, 2, 3, 3, 4], dtype='int64')

df.loc[df.index.repeat((df['Flag'] | df['Flag2']) + 1)]

            Name  Age  Flag  Flag2  assigned_code
0  Thomas McMann   30     0      0           1172
1   Johnny Chien   25     0      1           1172
1   Johnny Chien   25     0      1           1172
2   Robert McDoe   68     1      0           1172
2   Robert McDoe   68     1      0           1172
3    Jason Singh  104     1      0           1172
3    Jason Singh  104     1      0           1172
4  Sarah Jessica   13     0      0           1172

reset_index is used to make the index unique again.

Value Replacement:

index.duplicated will indicate where there are new duplicated rows:

m = idx.duplicated()
[False False  True False  True False  True False]

This allows just selecting the new rows:

df.loc[m, ['Flag', 'Flag2']]

   Flag  Flag2
2     0      1
4     1      0
6     1      0

The dot product works to get the correct mapping assuming Flag and Flag2 will always be 1 or the other:

df.loc[m, ['Flag', 'Flag2']].dot([1105, 1050])

2    1050
4    1105
6    1105
dtype: int64

Alternatively we could use 2 loc assignments:

df.loc[m & df['Flag'].eq(1), 'assigned_code'] = 1105
df.loc[m & df['Flag2'].eq(1), 'assigned_code'] = 1050

Pandas - Duplicate rows based on a condition and simultaneously change a column value

Answers (1)

Related Questions