Reputation: 139
I'm trying to use pandas to identify sub-sections of a dataframe which are identical. So, for example, if I have a dataframe like:
id A B
0 1 1 2
1 1 2 3
2 1 5 6
3 2 1 2
4 2 2 3
5 2 5 6
6 3 8 9
7 3 4 0
8 3 9 7
I want to group by ID, so Rows 0 - 2 would form Group 1, Rows 3 - 5 would form Group 2, and Rows 6 - 8 would form Group 3. I know I can use pd.groupby()
to group rows by ID. In the case here, Group 2 is a repetition of Group 1 (Columns A and B are identical in both)
What I then want to do is to remove repeated groups, so in this case I would want to remove the second group. My final dataframe would then look like:
id A B
0 1 1 2
1 1 2 3
2 1 5 6
6 3 8 9
7 3 4 0
8 3 9 7
Every column in the duplicate groups is the same, except for the ID which is different for each group. I only want to remove a group if it is identical for every row in the group. Any help would be much appreciated!
Upvotes: 1
Views: 749
Reputation: 323326
Check pd.crosstab
s=pd.crosstab(df.id,[df.A,df.B]).drop_duplicates().unstack()
s[s!=0].reset_index().drop(0,1)
Out[128]:
A B id
0 1 2 1
1 2 3 1
2 4 0 3
3 5 6 1
4 8 9 3
5 9 7 3
Upvotes: 1
Reputation: 164773
This is one way using a helper column and pd.Series.drop_duplicates
.
The idea is to first create a mapping from id
to a tuple of values representing all rows for that id
. Then drop duplicates and extract the index of the remainder.
df['C'] = list(zip(df['A'], df['B']))
s = df.groupby('id')['C'].apply(tuple)\
.drop_duplicates().index
res = df.loc[df['id'].isin(s), ['id', 'A', 'B']]
print(res)
id A B
0 1 1 2
1 1 2 3
2 1 5 6
6 3 8 9
7 3 4 0
8 3 9 7
Upvotes: 1