Reputation: 17
I have a sample pandas dataframe (df) that looks like:
Thomas McMann 30 0 0 1172
Johnny Chien 25 0 1 1172
Robert McDoe 68 1 0 1172
Jason Singh 104 1 0 1172
Sarah Jessica 13 0 0 1172
My goal is to duplicate the columns where 'Flag' or 'Flag2' are equal to '1', and then change the assigned_code in the duplicated column as per follows:
if 'Flag' = 1, 'assigned_code' = 1105
if 'Flag2' = 1, 'assigned_code' = 1050
The resulting output would look as follows:
Name Age Flag Flag2 assigned_code
Thomas McMann 30 0 0 1172
Johnny Chien 25 0 1 1172
Johnny Chien 25 0 1 1050
Robert McDoe 68 1 0 1172
Robert McDoe 68 1 0 1105
Jason Singh 104 1 0 1172
Jason Singh 104 1 0 1105
Sarah Jessica 13 0 0 1172
Any help would be greatly appreciated! I am new to python and have been looking through a bunch of previously answered questions but I can't seem to find a solution that would work for my dataset.
Upvotes: 0
Views: 54
Reputation: 35636
We can use index.repeat
+ loc
to scale up the DataFrame then reset_index
. Then we can use dot
to modify the assigned_code
based on where index was duplicated
and the Flags:
idx = df.index.repeat((df['Flag'] | df['Flag2']) + 1)
df = df.loc[idx].reset_index(drop=True)
m = idx.duplicated()
df.loc[m, 'assigned_code'] = df.loc[m, ['Flag', 'Flag2']].dot([1105, 1050])
df
:
Name Age Flag Flag2 assigned_code
0 Thomas McMann 30 0 0 1172
1 Johnny Chien 25 0 1 1172
2 Johnny Chien 25 0 1 1050
3 Robert McDoe 68 1 0 1172
4 Robert McDoe 68 1 0 1105
5 Jason Singh 104 1 0 1172
6 Jason Singh 104 1 0 1105
7 Sarah Jessica 13 0 0 1172
Explanation:
DataFrame and imports.
import pandas as pd
df = pd.DataFrame({
'Name': ['Thomas McMann', 'Johnny Chien', 'Robert McDoe', 'Jason Singh',
'Sarah Jessica'],
'Age': [30, 25, 68, 104, 13],
'Flag': [0, 0, 1, 1, 0],
'Flag2': [0, 1, 0, 0, 0],
'assigned_code': [1172, 1172, 1172, 1172, 1172]
})
We determine which rows are repeated by logical OR Flag and Flag2:
df['Flag'] | df['Flag2']
0 0
1 1
2 1
3 1
4 0
dtype: int64
Then we add 1 to make 0 repeated 1 time and 1 repeated 2 times:
(df['Flag'] | df['Flag2']) + 1
0 1
1 2
2 2
3 2
4 1
dtype: int64
We then select repeat the index and select from the DataFrame with loc
:
df.index.repeat((df['Flag'] | df['Flag2']) + 1)
Int64Index([0, 1, 1, 2, 2, 3, 3, 4], dtype='int64')
df.loc[df.index.repeat((df['Flag'] | df['Flag2']) + 1)]
Name Age Flag Flag2 assigned_code
0 Thomas McMann 30 0 0 1172
1 Johnny Chien 25 0 1 1172
1 Johnny Chien 25 0 1 1172
2 Robert McDoe 68 1 0 1172
2 Robert McDoe 68 1 0 1172
3 Jason Singh 104 1 0 1172
3 Jason Singh 104 1 0 1172
4 Sarah Jessica 13 0 0 1172
reset_index
is used to make the index unique again.
Value Replacement:
index.duplicated
will indicate where there are new duplicated rows:
m = idx.duplicated()
[False False True False True False True False]
This allows just selecting the new rows:
df.loc[m, ['Flag', 'Flag2']]
Flag Flag2
2 0 1
4 1 0
6 1 0
The dot
product works to get the correct mapping assuming Flag and Flag2 will always be 1 or the other:
df.loc[m, ['Flag', 'Flag2']].dot([1105, 1050])
2 1050
4 1105
6 1105
dtype: int64
Alternatively we could use 2 loc
assignments:
df.loc[m & df['Flag'].eq(1), 'assigned_code'] = 1105
df.loc[m & df['Flag2'].eq(1), 'assigned_code'] = 1050
Upvotes: 2