Reputation: 967
I'm trying to add a new column to a dataframe, based on a groupby of the original dataframe, and assigning one of three values based on the results of that groupby
i.e. In the df below I want to add a fourth column to show whether I have too many 'A's, too many 'C's, or just the right amount.
col1 col2 col3
0 1111 A 1
1 1111 B 3
2 1111 B 3
3 1111 B 3
4 1111 C 1
5 2222 A 1
6 2222 B 1
7 2222 C 2
8 2222 C 2
9 3333 A 2
10 3333 A 2
11 3333 B 1
12 3333 C 1
becomes...
col1 col2 col3 col4
0 1111 A 1 OK
1 1111 B 3 OK
2 1111 B 3 OK
3 1111 B 3 OK
4 1111 C 1 OK
5 2222 A 1 >C
6 2222 B 1 >C
7 2222 C 2 >C
8 2222 C 2 >C
9 3333 A 2 >A
10 3333 A 2 >A
11 3333 B 1 >A
12 3333 C 1 >A
I was thinking of using transform, but that returns series, and I think in this case I need to check two different columns?
Example:
d1 = {'col1': ['1111', '1111', '1111', '1111', '1111', '2222', '2222', '2222', '2222', '3333', '3333', '3333', '3333'],
'col2': ['A', 'B', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'A', 'A', 'B', 'C'],
'col3': [1, 3, 3, 3, 1, 1, 1, 2, 2, 2, 2, 1, 1]}
df1 = pd.DataFrame(data=d1)
d2 = {'col1': ['1111', '1111', '1111', '1111', '1111', '2222', '2222', '2222', '2222', '3333', '3333', '3333', '3333'],
'col2': ['A', 'B', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'A', 'A', 'B', 'C'],
'col3': [1, 3, 3, 3, 1, 1, 1, 2, 2, 2, 2, 1, 1],
'col4': ['OK', 'OK', 'OK', 'OK', 'OK', '>C', '>C', '>C', '>C', '>A', '>A', '>A', '>A']}
df2 = pd.DataFrame(data=d2)
print(df1)
print(df2)
Upvotes: 0
Views: 69
Reputation: 75080
From what I understand, try this :
Get counts using crosstab
of col2
wrt col1
values then using .loc[]
filter the values you want to consider , then using df.gt
compare if count is greater than 1 and do a dot
multiplication with column names , finally map
it back with series.fillna
Solution:
values_to_filter = ['A','C'] #put B for testing and it will show >B for first group
m = pd.crosstab(df1['col1'],df1['col2']).loc[:,values_to_filter]
df1['col4'] = (df1['col1'].map(m.gt(1).dot(m.columns).replace('',np.nan)
.radd('>')).fillna('OK'))
Output:
print(df1)
col1 col2 col3 col4
0 1111 A 1 OK
1 1111 B 3 OK
2 1111 B 3 OK
3 1111 B 3 OK
4 1111 C 1 OK
5 2222 A 1 >C
6 2222 B 1 >C
7 2222 C 2 >C
8 2222 C 2 >C
9 3333 A 2 >A
10 3333 A 2 >A
11 3333 B 1 >A
12 3333 C 1 >A
Additional Details: Where m
is a count of col2
values w.r.t col1
values:
print(m)
col2 A C
col1
1111 1 1
2222 1 2
3333 2 1
Upvotes: 1