Reputation: 29
I need a function, or a sentence, to identify duplicated rows in a group.
For example, if we have this dataframe, grouping by column ID 1, I need see if values of ID 2 are duplicated within its group, or not:
ID 1 | ID 2 |
---|---|
A | 1 |
A | 2 |
A | 2 |
A | 2 |
B | 11 |
B | 12 |
the desired output should be:
ID 1 | ID 2 | Duplicated_ID 2_in_its_group(ID 1) |
---|---|---|
A | 1 | false |
A | 2 | true |
A | 2 | true |
A | 2 | true |
B | 11 | false |
B | 12 | false |
Ideally, define a function. If not, a sentence to apply to the whole dataframe.
Thanks.
I tried a function with no result. Sorry.
Upvotes: 0
Views: 62
Reputation: 1864
ID 1
ID 2
using duplicated
.True
and the non-duplicated as False
.Implementation:
import pandas as pd
def identify_duplicates(df):
df['Duplicated_ID_2_in_its_group'] = df.groupby('ID 1')['ID 2'].transform(lambda x: x.duplicated(keep=False))
return df
data = {
'ID 1': ['A', 'A', 'A', 'A', 'B', 'B'],
'ID 2': [1, 2, 2, 2, 11, 12]
}
df = pd.DataFrame(data)
result_df = identify_duplicates(df)
print(result_df)
Upvotes: 1
Reputation: 2669
Please use:
df.groupby('ID 1')['ID 2'].transform(lambda x: x.duplicated(keep=False))
By setting keep on False, all duplicates are True.
Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
Sample Code:
import pandas as pd
data = {'ID 1': ['A', 'A', 'A', 'A', 'B', 'B'],
'ID 2': [1, 2, 2, 2, 11, 12]}
df = pd.DataFrame(data)
df['Duplicated_ID_2_in_its_group(ID_1)'] = df.groupby('ID 1')['ID 2'].transform(lambda x: x.duplicated(keep=False))
print(df)
Output:
ID 1 ID 2 Duplicated_ID_2_in_its_group(ID_1)
0 A 1 False
1 A 2 True
2 A 2 True
3 A 2 True
4 B 11 False
5 B 12 False
Upvotes: 2