Michael
Michael

Reputation: 8788

panda add several new columns based on values from other columns at the same time?

How to add several new columns based on values from other columns at the same time? I only found examples to add a row one at a time.

I am able to add 3 new columns but this does not seem efficient since it has to go through all the rows 3 times. Is there a way to traverse the DF once?

import pandas as pd
from decimal import Decimal
d = [
    {'A': 2, 'B': Decimal('628.00')},
    {'A': 1, 'B': Decimal('383.00')},
    {'A': 3, 'B': Decimal('651.00')},
    {'A': 2, 'B': Decimal('575.00')},
    {'A': 4, 'B': Decimal('1114.00')},
]

df = pd.DataFrame(d)

In : df
Out:
   A        B
0  2   628.00
1  1   383.00
2  3   651.00
3  2   575.00
4  4  1114.00

# How to do those in one operation to avoid traversing the DF 3 times
df['C'] = df.apply(lambda row: row['B']-1000, axis=1)
df['D'] = df.apply(lambda row: row['B']*row['B'], axis=1)
df['E'] = df.apply(lambda row: row['B']/2, axis=1)

In : df
Out:
   A        B        C             D       E
0  2   628.00  -372.00   394384.0000  314.00
1  1   383.00  -617.00   146689.0000  191.50
2  3   651.00  -349.00   423801.0000  325.50
3  2   575.00  -425.00   330625.0000  287.50
4  4  1114.00   114.00  1240996.0000  557.00

Upvotes: 1

Views: 107

Answers (1)

Alexander
Alexander

Reputation: 109686

I wouldn't use a lambda function. Simple vectorized implementation is both faster and easier to read.

df['C'] = df['B'] - 1000
df['D'] = df['B'] ** 2
df['E'] = df['B'] / 2

>>> df
   A        B        C             D       E
0  2   628.00  -372.00   394384.0000  314.00
1  1   383.00  -617.00   146689.0000  191.50
2  3   651.00  -349.00   423801.0000  325.50
3  2   575.00  -425.00   330625.0000  287.50
4  4  1114.00   114.00  1240996.0000  557.00

Let's time it on a dataframe with one million rows:

df = pd.concat([df for _ in range(200000)], ignore_index=True)
>>> df.shape
(1000000, 2)

>>> %%timeit -n 3
    df['C'] = df.apply(lambda row: row['B'] - 1000, axis=1)
    df['D'] = df.apply(lambda row: row['B'] * row['B'], axis=1)
    df['E'] = df.apply(lambda row: row['B'] / 2, axis=1)
3 loops, best of 3: 1min 20s per loop

>>> %%timeit -n 3
    df['C'] = df['B'] - 1000
    df['D'] = df['B'] ** 2
    df['E'] = df['B'] / 2
3 loops, best of 3: 49.7 s per loop

The speed is significantly faster if you did away with the Decimal type and used a float instead:

d = [
    {'A': 2, 'B': 628.00},
    {'A': 1, 'B': 383.00},
    {'A': 3, 'B': 651.00},
    {'A': 2, 'B': 575.00},
    {'A': 4, 'B': 1114.00}]

df = pd.DataFrame(d)
df = pd.concat([df for _ in range(200000)], ignore_index=True)

>>> %%timeit -n 3
    df['C'] = df['B'] - 1000
    df['D'] = df['B'] ** 2
    df['E'] = df['B'] / 2
3 loops, best of 3: 33.1 ms per loop

>>> df.shape
(1000000, 5)

Upvotes: 1

Related Questions