beta
beta

Reputation: 5666

pandas dataframe masks to write values into new column

Based on this solution, I created several masks on a Pandas dataframe to create a new column which should be filled from different columns (based on the conditions).

The (simplified) code looks as follows:

mask0 = (df['condition'] == 1)
mask1 = (df['condition'] == 0)

df.loc[mask0, 'newColumn'] = df['otherColumn1']
df.loc[mask1, 'newColumn'] = df['otherColumn2']

However upon execution of the third line I get the following error:

ValueError: cannot reindex from a duplicate axis

If I simply do this, it works:

df.loc[mask0, 'newColumn'] = 1

What am I doing wrong?

Upvotes: 9

Views: 7866

Answers (3)

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 476557

You need to mask the "data provider" as well:

df.loc[mask0, 'newColumn'] = df[mask0, 'otherColumn1']
df.loc[mask1, 'newColumn'] = df[mask1, 'otherColumn2']

If the first condition is true if the latter is false and vice versa, we can use np.where(..):

df['newColumn'] = np.where(mask0, df['otherColumn0'], df['otherColumn2'])

or you can use np.select(..) in case the two can be both false where we fallback on the old values in case both conditons are False:

df['newColumn'] = np.select(
    [mask0, mask1],
    [df['otherColumn1'], df['otherColumn2']],
    default=df['newColumn']
)

Here of course we make the assumption that newColumn already exists in the dataframe (for example through some previous processing).

Upvotes: 3

U13-Forward
U13-Forward

Reputation: 71560

Another solution with np.where:

df['newColumn'] = np.where(df['condition'].eq(1), df['otherColumn1'], df['condition'])
df['newColumn'] = np.where(df['condition'].eq(0), df['otherColumn2'], df['condition'])

Upvotes: 2

jezrael
jezrael

Reputation: 862431

You have to filter in both sides:

mask0 = (df['condition'] == 1)
mask1 = (df['condition'] == 0)

df.loc[mask0, 'newColumn'] = df.loc[mask0, 'otherColumn1']
df.loc[mask1, 'newColumn'] = df.loc[mask1, 'otherColumn2']

But here is better use numpy.select for avoid repeating code:

df['newColumn'] = np.select([mask0, mask1], 
                            [df['otherColumn1'], df['otherColumn2']], 
                            default=np.nan)

Upvotes: 17

Related Questions