Find the number of unique items per group

Question

I have a df A that looks like,

type    number    date 
1       1         2017-10-01
2       1         2017-10-31
1       2         2017-09-01
1       2         2017-08-01
2       2         2017-06-01

First I want to groupby number and select groups that each group need to at least contain 1 row with type == 1 and one row with type == 2; in order to do that I first check that the sizes of each group must be > 2,

g = A.groupby('number')
B = g.filter(lambda x: len(x) > 1) # df B gets the filter results on g

but I don't know how to also check type values 1 and 2 are contained in every group acquired above.

Second, based on the groups acquired from step 1, I want to create a Boolean column in A called type_2_before_type_1, it is assigned True if earliest date for type 1 substracts earliest date for type 2 is < 0 for each group, and False otherwise. Since there could be many rows with type == 1 and many rows with type == 2 in each group (how to get g['A'].transform('min') for type 1 and type 2 rows?). So the result df A looks like,

type    number    date         type_2_before_type_1
1       1         2017-10-01   True
2       1         2017-10-31   True
1       2         2017-09-01   False
1       2         2017-08-01   False
2       2         2017-06-01   False

e.g. for the group with number == 2, earliest date for type 1 2017-08-01 subtracts the earliest date for type 2 2017-06-01, result in 61 days; hence set False to type_2_before_type_1.

cs95 · Accepted Answer

For the first part, use groupby + nunique -

df.groupby('number').type.transform('nunique')

0    2
1    2
2    2
3    2
4    2
Name: type, dtype: int64

df = df[df.groupby('number').type.transform('nunique').gt(1)]

   type  number        date
0     1       1  2017-10-01
1     2       1  2017-10-31
2     1       2  2017-09-01
3     1       2  2017-08-01
4     2       2  2017-06-01

The second part is a bit challenging, but you can groupby, unstack, call diff and then get a mask -

df['type_2_before_type_1'] = \
 df.groupby(['number', 'type'])\
      .first()\
      .unstack()\
      .diff(-1, axis=1)\
      .iloc[:, 0]\
      .reindex(df.number)\
      .astype(int)\
      .lt(0)\
      .values

df

   type  number       date  type_2_before_type_1
0     1       1 2017-10-01                  True
1     2       1 2017-10-31                  True
2     1       2 2017-09-01                 False
3     1       2 2017-08-01                 False
4     2       2 2017-06-01                 False

Find the number of unique items per group

Answers (2)

Related Questions