Georg Heiler
Georg Heiler

Reputation: 17704

pandas duplicated records find same and different columns

I have a record like

raw_data = {
        'subject_id': ['1', '2', '2', '3', '3'],
        'name': ['A', 'B', 'B', 'C', 'D'],
        'age_group' : [1, 2, 2, 1, 1]}
df = pd.DataFrame(raw_data, columns = ['subject_id', 'name','age_group'])

which contains a (duplicated) ID and some additional columns. Below

ids = df.subject_id
df[ids.isin(ids[ids.duplicated()])]

will return already only the duplicated records. Now I would want to better understand the columns which are

for each duplicated record, i.e. in this case here I would want to receive the offending duplicate IDs and the respective columns where the rows are different.

  subject_id name
1          2    B
2          2    B
3          3    C
4          3    D

Upvotes: 2

Views: 1362

Answers (2)

spies006
spies006

Reputation: 2927

Here's one way to do it, which allows for exploring and better understanding where differences occur.

We will set up a column that indicates we have found a row where a given subject_id does not match what we have previously recorded for their name or their age_group, i.e. we've found a difference. This column will be called diff_indicator and is binary 0 if not different from previously recorded values, and 1 if it is different.

storage_dict = {}


for i in range(len(df)):
    if not df.loc[i, 'subject_id'] in storage_dict:
        storage_dict[df.loc[i, 'subject_id']] = (df.loc[i, 'name'], df.loc[i, 'age_group'])
for i in range(len(df)):
    if (df.loc[i, 'name'] != storage_dict[df.loc[i, 'subject_id']][0]) | \
    (df.loc[i, 'age_group'] != storage_dict[df.loc[i, 'subject_id']][1]):
        df.loc[i, 'diff_indicator'] = 1
    else:
        df.loc[i, 'diff_indicator'] = 0

From here we can explore where there exists a difference.

>>> df.loc[df['diff_indicator'] == 1]['subject_id'].unique()
array(['2', '3'], dtype=object)

>>> df.loc[df['diff_indicator'] == 1]
  subject_id name  age_group  diff_indicator
2          2    B          1             1.0
4          3    D          1             1.0

There are many ways to query this depending on your interest. We can return the desired output like this...

>>> df.loc[df['subject_id'].isin(df.loc[df['diff_indicator'] == 1]['subject_id'].unique())]
  subject_id name  age_group  diff_indicator
1          2    B          2             0.0
2          2    B          1             1.0
3          3    C          3             0.0
4          3    D          1             1.0

We can see the first two columns of this data frame are the desired output.

Upvotes: 1

EFT
EFT

Reputation: 2369

If you have

>>>duplicated_ids
  subject_id name  age_group
1          2    B          2
2          2    B          2
3          3    C          1
4          3    D          1

Then

>>>othercols = duplicated_ids.columns[1:]
>>>outcols = ['subject_id']
>>>for col in othercols:
       if not duplicated_ids.drop_duplicates(['subject_id', col], keep=False).empty:
           outcols.append(col)

>>>duplicated_ids.loc[:, outcols]
  subject_id name
1          2    B
2          2    B
3          3    C
4          3    D

Upvotes: 3

Related Questions