Reputation: 2150
I have a data frame like this:
mydf = {'p1':[0.1, 0.2, 0.3], 'p2':[0.2, 0.1,0.3], 'p3':[0.1,0.9, 0.01], 'p4':[0.11, 0.2, 0.4], 'p5':[0.3, 0.1,0.5],
'w1':['cancel','hello', 'hi'], 'w2':['good','bad','ugly'], 'w3':['thanks','CUSTOM_MASK','great'],
'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible'],'w5':['CUSTOM_MASK','CUSTOM_MASK','job']}
df = pd.DataFrame(mydf)
So what I need to do is to sum up all values in column p1,p2,p3,p4,p5
if the correspondent values in w1,w2,w3,w4,w5
is not CUSTOM_MASK
or CUSTOM_UNKNOWN
.
So the result would be to add a column to the data frame like this: (0.1+0.2+0.1=0.4 is for the first row).
top_p
0.4
0.3
1.51
So my question is that is there any pandas way to do this?
What I have done so far is to loop through the rows and then columns and check the values (CUSTOM_MASK, CUSTOM_UNKNOWN)
and then sum it up if those values was not exist in the columns.
Upvotes: 0
Views: 2201
Reputation: 9379
Here's a way to do this using np.dot()
:
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
mydf['top_p'] = mydf.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
We first prepare the two sets of column names p1,...,p5
and w1,...,w5
.
Then we use apply()
to take the dot product of the values in the pN
columns with the filtering criteria based on the wN
columns (namely include only contributions from pN
column values whose corresponding wN
column value is not in the list of excluded strings).
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51
Alternatively, element-wise multiplication and sum across columns can be used like this:
pCols, wCols = [[c for c in mydf.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
mydf['top_p'] = (mydf[pCols] * ~mydf[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
Here, we needed to rename the columns of one of the 5-column DataFrames to ensure that *
(DataFrame.multiply()
) can do the element-wise multiplication.
UPDATE: Here are a few timing comparisons on various possible methods for solving this question:
#1. Pandas mask and sum (see answer by @enke):
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
#2. Pandas apply with Numpy dot solution:
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
df['top_p'] = df.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
#3. Pandas element-wise multiply and sum:
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
df['top_p'] = (df[pCols] * ~df[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
#4. Numpy element-wise multiply and sum:
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
df['top_p'] = (df[pCols].to_numpy() * ~df[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Timing results:
Timeit results for df with 30000 rows:
method_1 ran in 0.008165133331203833 seconds using 3 iterations
method_2 ran in 13.408894366662329 seconds using 3 iterations
method_3 ran in 0.007688766665523872 seconds using 3 iterations
method_4 ran in 0.006326200003968552 seconds using 3 iterations
Time performance results: Method #4 (numpy multiply/sum) is about 20% faster than the runners-up. Methods #1 and #3 (pandas mask/sum vs multiply/sum) are neck-and-neck in second place. Method #2 (pandas apply/numpy dot) is frightfully slow.
Here's the timeit() test code in case it's of interest:
import pandas as pd
import numpy as np
nListReps = 10000
df = pd.DataFrame({'p1':[0.1, 0.2, 0.3]*nListReps, 'p2':[0.2, 0.1,0.3]*nListReps, 'p3':[0.1,0.9, 0.01]*nListReps, 'p4':[0.11, 0.2, 0.4]*nListReps, 'p5':[0.3, 0.1,0.5]*nListReps,
'w1':['cancel','hello', 'hi']*nListReps, 'w2':['good','bad','ugly']*nListReps, 'w3':['thanks','CUSTOM_MASK','great']*nListReps,
'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible']*nListReps,'w5':['CUSTOM_MASK','CUSTOM_MASK','job']*nListReps})
from timeit import timeit
def foo_1(df):
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
return df
def foo_2(df):
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
df['top_p'] = df.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
return df
def foo_3(df):
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
df['top_p'] = (df[pCols] * ~df[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
return df
def foo_4(df):
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
df['top_p'] = (df[pCols].to_numpy() * ~df[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
return df
n = 3
print(f'Timeit results for df with {len(df.index)} rows:')
for foo in ['foo_'+str(i + 1) for i in range(4)]:
t = timeit(f"{foo}(df.copy())", setup=f"from __main__ import df, {foo}", number=n) / n
print(f'{foo} ran in {t} seconds using {n} iterations')
Conclusion: The absolute fastest of these four approaches seems to be Numpy element-wise multiply and sum. However, @enke's Pandas mask and sum is pretty close in performance and is arguably the most aesthetically pleasing of the four candidates.
Perhaps this hybrid of the two (which runs about as fast as #4 above) is worth considering:
df['top_p'] = (df.filter(like='p').to_numpy() * ~df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Upvotes: 1
Reputation:
You can use mask
. The idea is to create a boolean mask with the w
columns, and use it to filter the relevant w
columns and sum
:
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51
Before sum
ming, the output of mask
looks like:
p1 p2 p3 p4 p5
0 0.1 0.2 0.10 NaN NaN
1 0.2 0.1 NaN NaN NaN
2 0.3 0.3 0.01 0.4 0.5
Upvotes: 4