FClubb
FClubb

Reputation: 139

How to identify identical groups using pandas.groupby()?

I'm trying to use pandas to identify sub-sections of a dataframe which are identical. So, for example, if I have a dataframe like:

   id  A  B 
0   1  1  2
1   1  2  3
2   1  5  6
3   2  1  2
4   2  2  3
5   2  5  6
6   3  8  9
7   3  4  0
8   3  9  7

I want to group by ID, so Rows 0 - 2 would form Group 1, Rows 3 - 5 would form Group 2, and Rows 6 - 8 would form Group 3. I know I can use pd.groupby() to group rows by ID. In the case here, Group 2 is a repetition of Group 1 (Columns A and B are identical in both)

What I then want to do is to remove repeated groups, so in this case I would want to remove the second group. My final dataframe would then look like:

   id  A  B 
0   1  1  2
1   1  2  3
2   1  5  6
6   3  8  9
7   3  4  0
8   3  9  7

Every column in the duplicate groups is the same, except for the ID which is different for each group. I only want to remove a group if it is identical for every row in the group. Any help would be much appreciated!

Upvotes: 1

Views: 749

Answers (2)

BENY
BENY

Reputation: 323326

Check pd.crosstab

s=pd.crosstab(df.id,[df.A,df.B]).drop_duplicates().unstack()
s[s!=0].reset_index().drop(0,1)
Out[128]: 
   A  B  id
0  1  2   1
1  2  3   1
2  4  0   3
3  5  6   1
4  8  9   3
5  9  7   3

Upvotes: 1

jpp
jpp

Reputation: 164773

This is one way using a helper column and pd.Series.drop_duplicates.

The idea is to first create a mapping from id to a tuple of values representing all rows for that id. Then drop duplicates and extract the index of the remainder.

df['C'] = list(zip(df['A'], df['B']))

s = df.groupby('id')['C'].apply(tuple)\
      .drop_duplicates().index

res = df.loc[df['id'].isin(s), ['id', 'A', 'B']]

print(res)

   id  A  B
0   1  1  2
1   1  2  3
2   1  5  6
6   3  8  9
7   3  4  0
8   3  9  7

Upvotes: 1

Related Questions