sariii
sariii

Reputation: 2150

How to sum over some columns based on condition in pandas

I have a data frame like this:

mydf = {'p1':[0.1, 0.2, 0.3], 'p2':[0.2, 0.1,0.3], 'p3':[0.1,0.9, 0.01], 'p4':[0.11, 0.2, 0.4], 'p5':[0.3, 0.1,0.5],
        'w1':['cancel','hello', 'hi'], 'w2':['good','bad','ugly'], 'w3':['thanks','CUSTOM_MASK','great'],
        'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible'],'w5':['CUSTOM_MASK','CUSTOM_MASK','job']}
df = pd.DataFrame(mydf)

So what I need to do is to sum up all values in column p1,p2,p3,p4,p5 if the correspondent values in w1,w2,w3,w4,w5 is not CUSTOM_MASK or CUSTOM_UNKNOWN.

So the result would be to add a column to the data frame like this: (0.1+0.2+0.1=0.4 is for the first row).

top_p
0.4
0.3
1.51

So my question is that is there any pandas way to do this?

What I have done so far is to loop through the rows and then columns and check the values (CUSTOM_MASK, CUSTOM_UNKNOWN) and then sum it up if those values was not exist in the columns.

Upvotes: 0

Views: 2201

Answers (2)

constantstranger
constantstranger

Reputation: 9379

Here's a way to do this using np.dot():

pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
mydf['top_p'] = mydf.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)

We first prepare the two sets of column names p1,...,p5 and w1,...,w5.

Then we use apply() to take the dot product of the values in the pN columns with the filtering criteria based on the wN columns (namely include only contributions from pN column values whose corresponding wN column value is not in the list of excluded strings).

Output:

    p1   p2    p3    p4   p5      w1    w2           w3              w4           w5  top_p
0  0.1  0.2  0.10  0.11  0.3  cancel  good       thanks     CUSTOM_MASK  CUSTOM_MASK   0.40
1  0.2  0.1  0.90  0.20  0.1   hello   bad  CUSTOM_MASK  CUSTOM_UNKNOWN  CUSTOM_MASK   0.30
2  0.3  0.3  0.01  0.40  0.5      hi  ugly        great          trible          job   1.51

Alternatively, element-wise multiplication and sum across columns can be used like this:

pCols, wCols = [[c for c in mydf.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
mydf['top_p'] = (mydf[pCols] * ~mydf[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)

Here, we needed to rename the columns of one of the 5-column DataFrames to ensure that * (DataFrame.multiply()) can do the element-wise multiplication.

UPDATE: Here are a few timing comparisons on various possible methods for solving this question:

#1. Pandas mask and sum (see answer by @enke):

df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)

#2. Pandas apply with Numpy dot solution:

pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
df['top_p'] = df.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)

#3. Pandas element-wise multiply and sum:

pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
df['top_p'] = (df[pCols] * ~df[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)

#4. Numpy element-wise multiply and sum:

pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
df['top_p'] = (df[pCols].to_numpy() * ~df[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)

Timing results:

Timeit results for df with 30000 rows:
method_1 ran in 0.008165133331203833 seconds using 3 iterations
method_2 ran in 13.408894366662329 seconds using 3 iterations
method_3 ran in 0.007688766665523872 seconds using 3 iterations
method_4 ran in 0.006326200003968552 seconds using 3 iterations

Time performance results: Method #4 (numpy multiply/sum) is about 20% faster than the runners-up. Methods #1 and #3 (pandas mask/sum vs multiply/sum) are neck-and-neck in second place. Method #2 (pandas apply/numpy dot) is frightfully slow.

Here's the timeit() test code in case it's of interest:

import pandas as pd
import numpy as np
nListReps = 10000
df = pd.DataFrame({'p1':[0.1, 0.2, 0.3]*nListReps, 'p2':[0.2, 0.1,0.3]*nListReps, 'p3':[0.1,0.9, 0.01]*nListReps, 'p4':[0.11, 0.2, 0.4]*nListReps, 'p5':[0.3, 0.1,0.5]*nListReps,
        'w1':['cancel','hello', 'hi']*nListReps, 'w2':['good','bad','ugly']*nListReps, 'w3':['thanks','CUSTOM_MASK','great']*nListReps,
        'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible']*nListReps,'w5':['CUSTOM_MASK','CUSTOM_MASK','job']*nListReps})

from timeit import timeit

def foo_1(df):
    df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
    return df

def foo_2(df):
    pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
    df['top_p'] = df.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
    return df

def foo_3(df):
    pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
    colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
    df['top_p'] = (df[pCols] * ~df[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
    return df

def foo_4(df):
    pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
    df['top_p'] = (df[pCols].to_numpy() * ~df[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
    return df

n = 3
print(f'Timeit results for df with {len(df.index)} rows:')
for foo in ['foo_'+str(i + 1) for i in range(4)]:
    t = timeit(f"{foo}(df.copy())", setup=f"from __main__ import df, {foo}", number=n) / n
    print(f'{foo} ran in {t} seconds using {n} iterations')

Conclusion: The absolute fastest of these four approaches seems to be Numpy element-wise multiply and sum. However, @enke's Pandas mask and sum is pretty close in performance and is arguably the most aesthetically pleasing of the four candidates.

Perhaps this hybrid of the two (which runs about as fast as #4 above) is worth considering:

df['top_p'] = (df.filter(like='p').to_numpy() * ~df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)

Upvotes: 1

user7864386
user7864386

Reputation:

You can use mask. The idea is to create a boolean mask with the w columns, and use it to filter the relevant w columns and sum:

df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)

Output:

    p1   p2    p3    p4   p5      w1    w2           w3              w4           w5  top_p
0  0.1  0.2  0.10  0.11  0.3  cancel  good       thanks     CUSTOM_MASK  CUSTOM_MASK   0.40
1  0.2  0.1  0.90  0.20  0.1   hello   bad  CUSTOM_MASK  CUSTOM_UNKNOWN  CUSTOM_MASK   0.30
2  0.3  0.3  0.01  0.40  0.5      hi  ugly        great          trible          job   1.51

Before summing, the output of mask looks like:

    p1   p2    p3   p4   p5
0  0.1  0.2  0.10  NaN  NaN
1  0.2  0.1   NaN  NaN  NaN
2  0.3  0.3  0.01  0.4  0.5

Upvotes: 4

Related Questions