Pandas Groupby Operation For Condition Based Feature Creation

Question

Having difficulties to create a feature based on the some groupby + conditions

The data that I've looks similar to

	ir_id	pli	pli_missing	err_type
0	name1	1.0	no	UNKNOWN
1	name1	2.0	no	NaN
2	name1	3.0	no	NaN
3	name1	NaN	yes	UNKNOWN
4	name2	4.0	no	NaN
5	name2	5.0	no	NaN
6	name2	NaN	yes	UNKNOWN
7	name3	6.0	no	NaN
8	name3	7.0	no	NaN
9	name3	8.0	no	NaN
10	name3	9.0	no	UNKNOWN
11	name4	10.0	no	NaN
12	name4	11.0	no	NaN
13	name4	12.0	no	NaN
14	name5	NaN	yes	UNKNOWN
15	name5	NaN	yes	UNKNOWN
16	name5	NaN	yes	UNKNOWN
17	name5	NaN	yes	UNKNOWN

I want to groupby at ir_id such that I can create err_flag column which is:

type1: atleast 1 row having value "UNKNOWN" in err_type column, and also "yes" in pli_missing

	ir_id	pli	pli_missing	err_type	err_flag
4	name2	4.0	no	NaN	type1
5	name2	5.0	no	NaN	type1
6	name2	NaN	yes	UNKNOWN	type1

	ir_id	pli	pli_missing	err_type	err_flag
14	name5	NaN	yes	UNKNOWN	type1
15	name5	NaN	yes	UNKNOWN	type1
16	name5	NaN	yes	UNKNOWN	type1
17	name5	NaN	yes	UNKNOWN	type1

type2: atleast 1 row having value "UNKNOWN" in err_type column, and also "no" in pli_missing

	ir_id	pli	pli_missing	err_type	err_flag
7	name3	6.0	no	NaN	type2
8	name3	7.0	no	NaN	type2
9	name3	8.0	no	NaN	type2
10	name3	9.0	no	UNKNOWN	type2

type3: no row having value "UNKNOWN" in err_type column, and also "no" in pli_missing

	ir_id	pli	pli_missing	err_type	err_flag
11	name4	10.0	no	NaN	type3
12	name4	11.0	no	NaN	type3
13	name4	12.0	no	NaN	type3

both_type: both type1 and type2 error flag, i.e.

	ir_id	pli	pli_missing	err_type	err_flag
0	name1	1.0	no	UNKNOWN	both_type
1	name1	2.0	no	NaN	both_type
2	name1	3.0	no	NaN	both_type
3	name1	NaN	yes	UNKNOWN	both_type

Which results in final O/p as:

	ir_id	pli	pli_missing	err_type	err_flag
0	name1	1.0	no	UNKNOWN	both_type
1	name1	2.0	no	NaN	both_type
2	name1	3.0	no	NaN	both_type
3	name1	NaN	yes	UNKNOWN	both_type
4	name2	4.0	no	NaN	type1
5	name2	5.0	no	NaN	type1
6	name2	NaN	yes	UNKNOWN	type1
7	name3	6.0	no	NaN	type2
8	name3	7.0	no	NaN	type2
9	name3	8.0	no	NaN	type2
10	name3	9.0	no	UNKNOWN	type2
11	name4	10.0	no	NaN	type3
12	name4	11.0	no	NaN	type3
13	name4	12.0	no	NaN	type3
14	name5	NaN	yes	UNKNOWN	type1
15	name5	NaN	yes	UNKNOWN	type1
16	name5	NaN	yes	UNKNOWN	type1
17	name5	NaN	yes	UNKNOWN	type1

dataset used:

custom_df = pd.DataFrame.from_dict({
    'ir_id':['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name3', 'name3', 'name3', 'name3', 'name4', 'name4', 'name4', 'name5', 'name5', 'name5', 'name5']
    , 'pli': [1,       2,       3,       np.nan,  4,       5,       np.nan,  6,       7,       8,       9,       10,      11,      12,      np.nan,  np.nan,  np.nan,  np.nan]
    , 'pli_missing': ["no","no","no","yes","no","no","yes","no","no","no","no","no","no","no","yes","yes","yes","yes"]
    , 'err_type': ["UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN","UNKNOWN","UNKNOWN","UNKNOWN"]
    , 'err_flag': ["both_type", "both_type", "both_type", "both_type", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type3", "type3", "type3", "type1", "type1", "type1", "type1"]
})

custom_df

PS Earlier solution can't handle cases for ir_id = name5

Vitalizzare · Accepted Answer

I think we could apply here sort of categorical logic. Here's what I mean.

Let's say ['type3','type1','type2','both_types'] are verbal representatives of codes [0, 1, 2, 3]. Why this order? As I can see, type3 is sort of a default value. type1 and type2 are equal by nature, but we have somehow to differ them. So let's follow their names and say that type1 is 1 and type2 is 2. And both_types has index 3 as a sum of previos two codes. Now we can separate identifying if a record can be of type 1 or 2, and get their sum as a final output. If a record is neither type 1 nor type 2 kind of error, the sum will result in 0. If there's only one of them, the sum will keep it as is. If a record can be of both types, the sum will show 3.

Let's see how it looks in code:

err = custom_df['err_type'] == 'UNKNOWN'
pli = custom_df['pli_missing'] == 'yes'
grouper = custom_df['ir_id']

# with transform get the same dimension as custom_df
type1 = (err & pli).groupby(grouper).transform(any)
type2 = (err & ~pli).groupby(grouper).transform(any)

codes = type1 + 2*type2
categories = ['type3','type1','type2','both_types']
custom_df['err_flag_new'] = pd.Categorical.from_codes(codes, categories)

Here's what I've got in the end:

update

We can look at it this way. Suppose there's a system with some number of independent states. A system can be described as having any combination of them. It can be described mathematically as a binary code. For each independant state we assing a unique place in this code, where values 1 or 0 are interpreted as a logical answer whether the system has a corresponding state. Binary means a sum like this one:

state[0]*2^0 + state[1]*2^1 + state[2]*2^2 + state[3]*2^3 + ...

In our case we have only 2 independent states type1 and type2. Two others are their combinations: type3 means neather of them, and both_types tells for itself. So we have only first two terms of the sum above, where state[0] is a logical value for type1 and state[1] is a logical value for type2. That's why I used codes = type1 + 2*type2 which is equal to codes = type1 * 2**0 + type2 * 2**1

As for the order in ['type3','type1','type2','both_types'], in this list indexes of values resemble the corresponding codes, ie. each type here has the index equal to their binary code. The binary code for type3 in this model is 0b00 which is zero, for type1 it is 0b01 which is one, for type2 - 0b10 which is 2 and for both_types it's 0b11 which is equal to 3. These codes are atomatically assigned when creating Categorical sequence from codes, ie. pandas is using codes as indexes to get corresponding values from the list and place them instead of the codes.

See also Enum.IntFlag as an abstract realization of this idea, and Flags in the regular expression module as an example of how it can be used.

Pandas Groupby Operation For Condition Based Feature Creation

Answers (2)

Related Questions