pandas duplicated records find same and different columns

Question

I have a record like

raw_data = {
        'subject_id': ['1', '2', '2', '3', '3'],
        'name': ['A', 'B', 'B', 'C', 'D'],
        'age_group' : [1, 2, 2, 1, 1]}
df = pd.DataFrame(raw_data, columns = ['subject_id', 'name','age_group'])

which contains a (duplicated) ID and some additional columns. Below

ids = df.subject_id
df[ids.isin(ids[ids.duplicated()])]

will return already only the duplicated records. Now I would want to better understand the columns which are

the same
different

for each duplicated record, i.e. in this case here I would want to receive the offending duplicate IDs and the respective columns where the rows are different.

  subject_id name
1          2    B
2          2    B
3          3    C
4          3    D

EFT · Accepted Answer

If you have

>>>duplicated_ids
  subject_id name  age_group
1          2    B          2
2          2    B          2
3          3    C          1
4          3    D          1

Then

>>>othercols = duplicated_ids.columns[1:]
>>>outcols = ['subject_id']
>>>for col in othercols:
       if not duplicated_ids.drop_duplicates(['subject_id', col], keep=False).empty:
           outcols.append(col)

>>>duplicated_ids.loc[:, outcols]
  subject_id name
1          2    B
2          2    B
3          3    C
4          3    D

pandas duplicated records find same and different columns

Answers (2)

Related Questions