Reputation: 1973
I have a pandas dataframe.
import pandas as pd
data = pd.DataFrame({
'a': [0,1,0,0,1,1,0,1],
'b': [0,0,1,0,1,0,1,1],
'c': [0,0,0,1,0,1,1,1],
'rate': [0,0.1,0.11,0.12,0.24,0.27,0.3,0.4]})
a,b,c are my channels and I am adding another column that shows the sum of row wise totals of these channels by writing:
data['total'] = data.a + data.b + data.c
data
a b c rate total
1 1 0 0 0.10 1
2 0 1 0 0.11 1
3 0 0 1 0.12 1
4 1 1 0 0.24 2
5 1 0 1 0.27 2
6 0 1 1 0.30 2
7 1 1 1 0.40 3
I would like to work on data where total = 1 and total = 2
reduced = data[(data.a == 1) & (data.total == 2)]
print(reduced)
a b c rate total
4 1 1 0 0.24 2
5 1 0 1 0.27 2
I would like to add columns to this reduced dataframe that looks like this:
a b c rate total prob_a prob_b prob_c
4 1 1 0 0.24 2 0.1 0.11 0
5 1 0 1 0.27 2 0.1 0 0.12
In the first row of the reduced data-frame, prob_c is 0 since C is absent ( ABC => 110). In the second row of the reduced data-frame, prob_b is 0 since B is absent (ABC => 101)
where,
# Channel a alone occurs (ABC => 100)
prob_a = data['rate'][(data.a == 1) & (data.total == 1)]
# Channel b alone occurs (ABC => 010)
prob_b = data['rate'][(data.b == 1) & (data.total == 1)]
# Channel c alone occurs (ABC => 001)
prob_c = data['rate'][(data.c == 1) & (data.total == 1)]
I tried this:
reduced['prob_a'] = data['rate'][(data.a == 1) & (data.total == 1)]
reduced['prob_b'] = data['rate'][(data.b == 1) & (data.total == 1)]
reduced['prob_c'] = data['rate'][(data.c == 1) & (data.total == 1)]
print(reduced)
which results in this output:
a b c rate total prob_a prob_b prob_c
4 1 1 0 0.24 2 NaN NaN NaN
5 1 0 1 0.27 2 NaN NaN NaN
Upvotes: 2
Views: 322
Reputation: 294218
Break it into two steps
One
Calculate Probabilities
probs = data.query('a + b + c == 1').pipe(
lambda d: d.drop('rate', 1).T.dot(d.rate)
)
probs
a 0.10
b 0.11
c 0.12
dtype: float64
How it works
Get just the total equals to one rows:
data.query('a + b + c == 1')
a b c rate
1 1 0 0 0.10
2 0 1 0 0.11
3 0 0 1 0.12
pipe
allows us to pipeline the results into function. From the result above, I want to dot product the columns a
, b
, c
with the column rate
. The lambda
via the pipe
allows me to do that.
More Generalized Approach
df = data.copy()
rate = df.pop('rate')
mask = df.sum(1) == 1
probs = df[mask].T.dot(rate[mask])
probs
Two
The Rest
data.join(
data.drop('rate', 1).mul(probs).add_prefix('prob_')
).query('a + b + c == 2 and a == 1')
a b c rate prob_a prob_b prob_c
4 1 1 0 0.24 0.1 0.11 0.00
5 1 0 1 0.27 0.1 0.00 0.12
Upvotes: 2
Reputation: 323226
IIUC
reduced['prob_a'] = data.loc[(data.a == 1) & (data.total == 1),'rate'].values[0]
reduced['prob_b'] = data.loc[(data.b == 1) & (data.total == 1),'rate'].values[0]
reduced['prob_c'] = data.loc[(data.c == 1) & (data.total == 1),'rate'].values[0]
reduced[['prob_a','prob_b','prob_c']]=reduced[['prob_a','prob_b','prob_c']].mul(reduced[['a','b','c']].eq(1).values)
reduced
Out[698]:
a b c rate total prob_a prob_b prob_c
4 1 1 0 0.24 2 0.1 0.11 0.00
5 1 0 1 0.27 2 0.1 0.00 0.12
Upvotes: 3