Reputation: 15

Add a column to a df where if a certain value is 0, return 1 else return the original value of the column

the Python code with which I am trying to achieve this result is:

df['column2'] = np.where(df['column1'] == 0, 1, df['column1'])

Upvotes: 0

Answers (4)

Trenton McKinney

Reputation: 62463

For the sample dataframe it is fastest to use np.where.
You can also use pandas.DataFrame.where, which will replace values where the condition is False otherwise return the value in the dataframe column.
100 is used to make the update easier to see

import pandas as pd

# test dataframe
df = pd.DataFrame({'a': [2, 4, 1, 0, 2, 2, 0, 8, 4, 0], 'b': [2, 4, 0, 9, 2, 0, 2, 8, 0, 3]})

# replace 0 with 100 or leave the same number based on the same column
df['0 → 100 on a if a'] = df.a.where(df.a != 0, 100)

# replace 0 with 100 or leave the same number based on a different column
df['0 → 100 on a if b'] = df.a.where(df.b != 0, 100)

# display(df)
   a  b  0 → 100 on a if a  0 → 100 on a if b
0  2  2                  2                  2
1  4  4                  4                  4
2  1  0                  1                100
3  0  9                100                  0
4  2  2                  2                  2
5  2  0                  2                100
6  0  2                100                  0
7  8  8                  8                  8
8  4  0                  4                100
9  0  3                100                  0

`%%timeit` testing

Test Data

import pandas as pd
import numpy as np

# test dataframe with 1M rows
np.random.seed(365)
df = pd.DataFrame({'a': np.random.randint(0, 10, size=(1000000)), 'b': np.random.randint(0, 10, size=(1000000))})

Tests

%%timeit
np.where(df.a == 0, 1, df.a)
[out]:
161 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
np.where(df.b == 0, 1, df.a)
[out]:
164 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df.a.where(df.a != 0, 1)
[out]:
4.51 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.a.where(df.b != 0, 1)
[out]:
4.55 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
noah1(df)
[out]:
4.63 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
noah2(df)
[out]:
15.3 s ± 205 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
paul(df)
[out]:
341 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
karam(df)
[out]:
299 ms ± 4.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Functions

def noah1(d):
    return d.a.replace(0, 1)

def noah2(d):
    return d.apply(lambda x: 1 if x.a == 0 else x.b, axis=1)

def paul(d):
    return [1 if v==0 else v for v in d.a.values]

def karam(d):
    return d.a.apply(lambda x: 1 if x == 0 else x)

Upvotes: 4

noah

Reputation: 2786

What you want is essentially to just copy the column and replace 0s with 1s:

df["Column2"] = df["Column1"].replace(0,1)

More generally if you wanted the value in some other ColumnX you can do the following lamda function:

df["Column2"] = df.apply(lambda x: 1 if x["Column1"]==0 else x['ColumnX'], axis=1)

Upvotes: 1

Paul Wilson

Reputation: 560

The apply example provided above should work or this works too:

df['column_2'] = [1 if v==0 else v for v in df['col'].values]

My example uses list comprehension: https://www.w3schools.com/python/python_lists_comprehension.asp

And the other answer uses lambda function: https://www.w3schools.com/python/python_lambda.asp

Personally, when writing scripts that others may use I think list comprehension is more widely known and therefore more verbose, but I believe lambda function performs faster and in general is a highly useful tool so probably recommended above list comprehension.

Upvotes: 2

Karan Shishoo

Reputation: 2802

You should be able to achieve that using an apply statement in this manner:

df['column2'] = df['column1'].apply(lambda x: 1 if x == 0 else x)