Reputation: 101
Having difficulties to create a feature based on the some groupby + conditions
The data that I've looks similar to
ir_id | pli | pli_missing | err_type | |
---|---|---|---|---|
0 | name1 | 1.0 | no | UNKNOWN |
1 | name1 | 2.0 | no | NaN |
2 | name1 | 3.0 | no | NaN |
3 | name1 | NaN | yes | UNKNOWN |
4 | name2 | 4.0 | no | NaN |
5 | name2 | 5.0 | no | NaN |
6 | name2 | NaN | yes | UNKNOWN |
7 | name3 | 6.0 | no | NaN |
8 | name3 | 7.0 | no | NaN |
9 | name3 | 8.0 | no | NaN |
10 | name3 | 9.0 | no | UNKNOWN |
11 | name4 | 10.0 | no | NaN |
12 | name4 | 11.0 | no | NaN |
13 | name4 | 12.0 | no | NaN |
14 | name5 | NaN | yes | UNKNOWN |
15 | name5 | NaN | yes | UNKNOWN |
16 | name5 | NaN | yes | UNKNOWN |
17 | name5 | NaN | yes | UNKNOWN |
I want to groupby at ir_id such that I can create err_flag
column which is:
err_type
column, and also "yes" in pli_missing
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
4 | name2 | 4.0 | no | NaN | type1 |
5 | name2 | 5.0 | no | NaN | type1 |
6 | name2 | NaN | yes | UNKNOWN | type1 |
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
14 | name5 | NaN | yes | UNKNOWN | type1 |
15 | name5 | NaN | yes | UNKNOWN | type1 |
16 | name5 | NaN | yes | UNKNOWN | type1 |
17 | name5 | NaN | yes | UNKNOWN | type1 |
err_type
column, and also "no" in pli_missing
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
7 | name3 | 6.0 | no | NaN | type2 |
8 | name3 | 7.0 | no | NaN | type2 |
9 | name3 | 8.0 | no | NaN | type2 |
10 | name3 | 9.0 | no | UNKNOWN | type2 |
err_type
column, and also "no" in pli_missing
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
11 | name4 | 10.0 | no | NaN | type3 |
12 | name4 | 11.0 | no | NaN | type3 |
13 | name4 | 12.0 | no | NaN | type3 |
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
0 | name1 | 1.0 | no | UNKNOWN | both_type |
1 | name1 | 2.0 | no | NaN | both_type |
2 | name1 | 3.0 | no | NaN | both_type |
3 | name1 | NaN | yes | UNKNOWN | both_type |
Which results in final O/p as:
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
0 | name1 | 1.0 | no | UNKNOWN | both_type |
1 | name1 | 2.0 | no | NaN | both_type |
2 | name1 | 3.0 | no | NaN | both_type |
3 | name1 | NaN | yes | UNKNOWN | both_type |
4 | name2 | 4.0 | no | NaN | type1 |
5 | name2 | 5.0 | no | NaN | type1 |
6 | name2 | NaN | yes | UNKNOWN | type1 |
7 | name3 | 6.0 | no | NaN | type2 |
8 | name3 | 7.0 | no | NaN | type2 |
9 | name3 | 8.0 | no | NaN | type2 |
10 | name3 | 9.0 | no | UNKNOWN | type2 |
11 | name4 | 10.0 | no | NaN | type3 |
12 | name4 | 11.0 | no | NaN | type3 |
13 | name4 | 12.0 | no | NaN | type3 |
14 | name5 | NaN | yes | UNKNOWN | type1 |
15 | name5 | NaN | yes | UNKNOWN | type1 |
16 | name5 | NaN | yes | UNKNOWN | type1 |
17 | name5 | NaN | yes | UNKNOWN | type1 |
dataset used:
custom_df = pd.DataFrame.from_dict({
'ir_id':['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name3', 'name3', 'name3', 'name3', 'name4', 'name4', 'name4', 'name5', 'name5', 'name5', 'name5']
, 'pli': [1, 2, 3, np.nan, 4, 5, np.nan, 6, 7, 8, 9, 10, 11, 12, np.nan, np.nan, np.nan, np.nan]
, 'pli_missing': ["no","no","no","yes","no","no","yes","no","no","no","no","no","no","no","yes","yes","yes","yes"]
, 'err_type': ["UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN","UNKNOWN","UNKNOWN","UNKNOWN"]
, 'err_flag': ["both_type", "both_type", "both_type", "both_type", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type3", "type3", "type3", "type1", "type1", "type1", "type1"]
})
custom_df
PS Earlier solution can't handle cases for ir_id = name5
Upvotes: 1
Views: 80
Reputation: 7230
I think we could apply here sort of categorical logic. Here's what I mean.
Let's say ['type3','type1','type2','both_types']
are verbal representatives of codes [0, 1, 2, 3]
. Why this order? As I can see, type3
is sort of a default value. type1
and type2
are equal by nature, but we have somehow to differ them. So let's follow their names and say that type1
is 1 and type2
is 2. And both_types
has index 3 as a sum of previos two codes. Now we can separate identifying if a record can be of type 1 or 2, and get their sum as a final output. If a record is neither type 1 nor type 2 kind of error, the sum will result in 0. If there's only one of them, the sum will keep it as is. If a record can be of both types, the sum will show 3.
Let's see how it looks in code:
err = custom_df['err_type'] == 'UNKNOWN'
pli = custom_df['pli_missing'] == 'yes'
grouper = custom_df['ir_id']
# with transform get the same dimension as custom_df
type1 = (err & pli).groupby(grouper).transform(any)
type2 = (err & ~pli).groupby(grouper).transform(any)
codes = type1 + 2*type2
categories = ['type3','type1','type2','both_types']
custom_df['err_flag_new'] = pd.Categorical.from_codes(codes, categories)
Here's what I've got in the end:
update
We can look at it this way. Suppose there's a system with some number of independent states. A system can be described as having any combination of them. It can be described mathematically as a binary code. For each independant state we assing a unique place in this code, where values 1 or 0 are interpreted as a logical answer whether the system has a corresponding state. Binary means a sum like this one:
state[0]*2^0 + state[1]*2^1 + state[2]*2^2 + state[3]*2^3 + ...
In our case we have only 2 independent states type1
and type2
. Two others are their combinations: type3
means neather of them, and both_types
tells for itself. So we have only first two terms of the sum above, where state[0]
is a logical value for type1
and state[1]
is a logical value for type2
. That's why I used codes = type1 + 2*type2
which is equal to codes = type1 * 2**0 + type2 * 2**1
As for the order in ['type3','type1','type2','both_types']
, in this list indexes of values resemble the corresponding codes, ie. each type here has the index equal to their binary code. The binary code for type3
in this model is 0b00
which is zero, for type1
it is 0b01
which is one, for type2
- 0b10
which is 2 and for both_types
it's 0b11
which is equal to 3. These codes are atomatically assigned when creating Categorical
sequence from codes
, ie. pandas is using codes as indexes to get corresponding values from the list and place them instead of the codes.
See also Enum.IntFlag as an abstract realization of this idea, and Flags in the regular expression module as an example of how it can be used.
Upvotes: 1
Reputation: 14159
I've found a solution. I trust someone else could find a better one, but it seems to arrive at the requested result correctly.
import pandas as pd
import numpy as np
df = custom_df.copy()
condlist = [((df['err_type'] == 'UNKNOWN') & (df['pli_missing'] == 'yes')),
((df['err_type'] == 'UNKNOWN') & (df['pli_missing'] == 'no'))]
choicelist = ['type1','type2']
df['err_flag'] = np.select(condlist, choicelist, default='type3')
s = df[df['err_flag'] != 'type3'].groupby('ir_id')['err_flag'].nunique().gt(1)
df.loc[df['ir_id'].isin(s[s].index),'err_flag'] = 'both_type'
df['err_flag'] = df.sort_values(['ir_id','err_flag']).groupby(['ir_id'])\
['err_flag'].transform('first')
print(df[['ir_id','err_flag']].drop_duplicates())
ir_id err_flag
0 name1 both_type
4 name2 type1
7 name3 type2
11 name4 type3
14 name5 type1
print(df.equals(custom_df))
# True
Explanation steps:
np.select
. In the condlist
we store the conditions for type1
and type2
, we let type3
be the default (i.e. neither condition is met), and assign the result to a new column, err_flag
.df
only the rows that do not contain type3
in the new column and use df.groubpy
and then .nunique
to get a count of the unique values for each group (minus a potential type3
of course).type1
and type2
) or just 1. So, we want to filter using .gt(1)
to get a boolean Series
that will have True
only for the groups that have both. These are the ones we want to overwrite with both_type
.df
again based on the aforementioned Series
and assign both_type
.df
on ['ir_id','err_flag']
and group by ir_id
. Now, we can ask for the first
value from each group and we wamt to assign this value to the entire group (hence, the use of .transform
), thus overwriting any type3
values that are left (i.e. both type1
and type2
will come first, because of the sort).Upvotes: 0