Reputation: 109
I have a dataframe as shown below:
col1 = ['a','b','c','a','c','a','b','c','a']
col2 = [1,1,0,1,1,0,1,1,0]
df2 = pd.DataFrame(zip(col1,col2),columns=['name','count'])
name count
0 a 1
1 b 1
2 c 0
3 a 1
4 c 1
5 a 0
6 b 1
7 c 1
8 a 0
I am trying to find the ratio of the number of zeros to the sum of zeros+ones corresponding to each element in the "name" column. Firstly i aggreated the counts as follows:
for j in df2.name.unique():
print(j)
zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
zero_pb = zero_ct / full_ct
one_pb = 1 - zero_pb
print(f"ZERO rations for {j} = {zero_pb}")
print(f"One ratios for {j} = {one_pb}")
print("="*30)
And the output looks like:
a
ZERO ratios for a = 0 0.5
dtype: float64
One ratios for a = 0 0.5
dtype: float64
==============================
b
ZERO ratios for b = 1 0.0
dtype: float64
One ratios for b = 1 1.0
dtype: float64
==============================
c
ZERO ratios for c = 2 0.333333
dtype: float64
One ratios for c = 2 0.666667
dtype: float64
==============================
My goal is to add 2 new columns to the dataframe: "name_0" and "name_1" with th ratio values for each element in the "name" column. I tried something but its not giving the expected results:
for j in df2.name.unique():
print(j)
zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
zero_pb = zero_ct / full_ct
one_pb = 1 - zero_pb
print(f"ZERO Probablitliy for {j} = {zero_pb}")
print(f"One Probablitliy for {j} = {one_pb}")
print("="*30)
condition1 = [ df2['name'].eq(j) & df2['count'].eq(0)]
condition2 = [ df2['name'].eq(j) & df2['count'].eq(1)]
choice1 = zero_pb.tolist()
choice2 = one_pb.tolist()
print(f'choice1 = {choice1}, choice2 = {choice2}')
df2["name"+str("_0")] = np.select(condition1, choice1, default=0)
df2["name"+str("_1")] = np.select(condition2, choice2, default=0)
The column is updated with the values of the name element 'c'. It's to be expected as the last computed values are being used to update all the values.
Is there another way to use the np.select effectively?
Expected output:
name count name_0 name_1
0 a 1 0.000000 0.500000
1 b 1 0.000000 1.000000
2 c 0 0.333333 0.000000
3 a 1 0.000000 0.500000
4 c 1 0.000000 0.666667
5 a 0 0.500000 0.000000
6 b 1 0.000000 1.000000
7 c 1 0.000000 0.666667
8 a 0 0.500000 0.000000
Upvotes: 0
Views: 66
Reputation: 888
I did not have access to zero_one_frequencies df. So I took the liberty of trying to solve the problem my way.
import pandas as pd
import numpy as np
col1 = ['a','b','c','a','c','a','b','c','a']
col2 = [1,1,0,1,1,0,1,1,0]
df2 = pd.DataFrame(zip(col1,col2),columns=['name','count'])
df2["name_0"] = 0
df2["name_1"] = 0
for name in df2['name'].unique():
df_name = df2[df2['name'] == name]
prob_1 = sum(df_name['count']/df_name.shape[0])
for count in df2['count'].unique():
indx = np.where((df2['name'] == name) & (df2['count'] == count))
df2["name_" + str(count)].loc[indx] = np.abs(((count +1) % 2) - prob_1)
Output:
name count name_0 name_1
0 a 1 0.000000 0.500000
1 b 1 0.000000 1.000000
2 c 0 0.333333 0.000000
3 a 1 0.000000 0.500000
4 c 1 0.000000 0.666667
5 a 0 0.500000 0.000000
6 b 1 0.000000 1.000000
7 c 1 0.000000 0.666667
8 a 0 0.500000 0.000000
For understanding np.select I recommend seeing this post.
Upvotes: 1
Reputation: 109
The following code fixed the issue. But, I couldn't find a way to get the same using numpy.select though.
df2["name"+str("_0")] = 0.0
df2["name"+str("_1")] = 0.0
for j in df2.name.unique():
print(j)
zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
zero_pb = zero_ct / full_ct
one_pb = 1 - zero_pb
print(f"ZERO Probablitliy for {j} = {zero_pb.tolist()[0]}")
print(f"One Probablitliy for {j} = {one_pb.tolist()[0]}")
print("="*30)
for idx in df2[df2['name']== j ].index:
print("Index:::", idx)
if df2['count'].iloc[idx] == 0:
df2.at[idx, "name"+str("_0")] = zero_pb.tolist()[0]
print(f'Count for {j} at index {idx} is {a}')
print('printing name_0: ', df2["name"+str("_0")].iloc[idx])
print("*"*30)
elif df2['count'].iloc[idx] == 1:
df2.at[idx, "name"+str("_1")] = one_pb.tolist()[0]
print(f'Count for {j} at index {idx} is {b}')
print('printing name_1: ', df2["name"+str("_1")].iloc[idx])
print("*"*30)
Upvotes: 0