Reputation: 17704
I have a record like
raw_data = {
'subject_id': ['1', '2', '2', '3', '3'],
'name': ['A', 'B', 'B', 'C', 'D'],
'age_group' : [1, 2, 2, 1, 1]}
df = pd.DataFrame(raw_data, columns = ['subject_id', 'name','age_group'])
which contains a (duplicated) ID and some additional columns. Below
ids = df.subject_id
df[ids.isin(ids[ids.duplicated()])]
will return already only the duplicated records. Now I would want to better understand the columns which are
for each duplicated record, i.e. in this case here I would want to receive the offending duplicate IDs and the respective columns where the rows are different.
subject_id name
1 2 B
2 2 B
3 3 C
4 3 D
Upvotes: 2
Views: 1362
Reputation: 2927
Here's one way to do it, which allows for exploring and better understanding where differences occur.
We will set up a column that indicates we have found a row where a given subject_id
does not match what we have previously recorded for their name
or their age_group
, i.e. we've found a difference. This column will be called diff_indicator
and is binary 0
if not different from previously recorded values, and 1
if it is different.
storage_dict = {}
for i in range(len(df)):
if not df.loc[i, 'subject_id'] in storage_dict:
storage_dict[df.loc[i, 'subject_id']] = (df.loc[i, 'name'], df.loc[i, 'age_group'])
for i in range(len(df)):
if (df.loc[i, 'name'] != storage_dict[df.loc[i, 'subject_id']][0]) | \
(df.loc[i, 'age_group'] != storage_dict[df.loc[i, 'subject_id']][1]):
df.loc[i, 'diff_indicator'] = 1
else:
df.loc[i, 'diff_indicator'] = 0
From here we can explore where there exists a difference.
>>> df.loc[df['diff_indicator'] == 1]['subject_id'].unique()
array(['2', '3'], dtype=object)
>>> df.loc[df['diff_indicator'] == 1]
subject_id name age_group diff_indicator
2 2 B 1 1.0
4 3 D 1 1.0
There are many ways to query this depending on your interest. We can return the desired output like this...
>>> df.loc[df['subject_id'].isin(df.loc[df['diff_indicator'] == 1]['subject_id'].unique())]
subject_id name age_group diff_indicator
1 2 B 2 0.0
2 2 B 1 1.0
3 3 C 3 0.0
4 3 D 1 1.0
We can see the first two columns of this data frame are the desired output.
Upvotes: 1
Reputation: 2369
If you have
>>>duplicated_ids
subject_id name age_group
1 2 B 2
2 2 B 2
3 3 C 1
4 3 D 1
Then
>>>othercols = duplicated_ids.columns[1:]
>>>outcols = ['subject_id']
>>>for col in othercols:
if not duplicated_ids.drop_duplicates(['subject_id', col], keep=False).empty:
outcols.append(col)
>>>duplicated_ids.loc[:, outcols]
subject_id name
1 2 B
2 2 B
3 3 C
4 3 D
Upvotes: 3