Cleb
Cleb

Reputation: 25997

How to simultanously add several columns to a dataframe in Pandas?

I have a dataframe like this (usually with far more columns and rows):

    A   B
0   5  10
1  15   3

Now I want to add columns A_ratio and B_ratio to this dataframe whereby the values there represent A/(A + B) and B/(A + B), respectively. So A_ratio and B_ratio should add up to 1 in each row of the dataframe.

My first attempt looked like this:

import pandas as pd

df = pd.DataFrame({'A': [5,15], 'B': [10,3]})

for coli in df:
    df[coli + '_ratio'] = df[coli]/df.sum(axis=1)

giving me the following result:

    A   B   A_ratio   B_ratio
0   5  10  0.333333  0.652174
1  15   3  0.833333  0.159292

Clearly, the columns A_ratio and B_ratio do not add up to 1. While the values in A_ratio are correct they are wrong in B_ratio since the row sum is changed when A_ratio is added.

A workaround could be to copy the dataframe first:

df2 = pd.DataFrame({'A': [5,15], 'B': [10,3]})
df2cl = df2.copy()

for coli in df2:
    df2[coli + '_ratio'] = df2[coli]/df2cl.sum(axis=1)

which gives me the desired output:

    A   B   A_ratio   B_ratio
0   5  10  0.333333  0.666667
1  15   3  0.833333  0.166667

Is there a more efficient way of doing this which avoids copying the dataframe?

Upvotes: 3

Views: 194

Answers (2)

Chris
Chris

Reputation: 967

You don't need to call sum each time.

 >>%timeit %run multiple_sum.py
   100 loops, best of 3: 6.59 ms per loop

 >>%timeit %run single_sum.py
   100 loops, best of 3: 3.84 ms per loop

if you have a big dataframe this is going to be needless overhead.

sums = df.sum(axis=1)

for coli in df:
    df[coli + '_ratio'] = df[coli]/sums

is sufficient

Upvotes: 2

EdChum
EdChum

Reputation: 393903

You can just sub-select from your df so that it only sums those 2 columns:

In [195]:
for coli in df:
    df[coli + '_ratio'] = df[coli]/df[['A','B']].sum(axis=1)
df

Out[195]:
    A   B   A_ratio   B_ratio
0   5  10  0.333333  0.666667
1  15   3  0.833333  0.166667

You can just take a copy of the column names upfront if you don't want to hardcode them:

In [197]:
cols = df.columns
for coli in df:
    df[coli + '_ratio'] = df[coli]/df[cols].sum(axis=1)
df

Out[197]:
    A   B   A_ratio   B_ratio
0   5  10  0.333333  0.666667
1  15   3  0.833333  0.166667

Upvotes: 2

Related Questions