How to identify identical groups using pandas.groupby()?

Question

I'm trying to use pandas to identify sub-sections of a dataframe which are identical. So, for example, if I have a dataframe like:

   id  A  B 
0   1  1  2
1   1  2  3
2   1  5  6
3   2  1  2
4   2  2  3
5   2  5  6
6   3  8  9
7   3  4  0
8   3  9  7

I want to group by ID, so Rows 0 - 2 would form Group 1, Rows 3 - 5 would form Group 2, and Rows 6 - 8 would form Group 3. I know I can use pd.groupby() to group rows by ID. In the case here, Group 2 is a repetition of Group 1 (Columns A and B are identical in both)

What I then want to do is to remove repeated groups, so in this case I would want to remove the second group. My final dataframe would then look like:

   id  A  B 
0   1  1  2
1   1  2  3
2   1  5  6
6   3  8  9
7   3  4  0
8   3  9  7

Every column in the duplicate groups is the same, except for the ID which is different for each group. I only want to remove a group if it is identical for every row in the group. Any help would be much appreciated!

jpp · Accepted Answer

This is one way using a helper column and pd.Series.drop_duplicates.

The idea is to first create a mapping from id to a tuple of values representing all rows for that id. Then drop duplicates and extract the index of the remainder.

df['C'] = list(zip(df['A'], df['B']))

s = df.groupby('id')['C'].apply(tuple)\
      .drop_duplicates().index

res = df.loc[df['id'].isin(s), ['id', 'A', 'B']]

print(res)

   id  A  B
0   1  1  2
1   1  2  3
2   1  5  6
6   3  8  9
7   3  4  0
8   3  9  7

How to identify identical groups using pandas.groupby()?

Answers (2)

Related Questions