DataScience99
DataScience99

Reputation: 369

How to find records with same value in one column but different value in another column

I have two pandas df with the exact same column names. One of these columns is named id_number which is unique to each table (What I mean is an id_number can only appear once in each df). I want to find all records that have the same id_number but have at least one different value in any column and store these records in a new pandas df.

I've tried merging (more specifically inner join), but it keeps only one record with that specific id_number so I can't look for any differences between the two dfs.

Let me provide some example to provide a clearer explanation:

Example dfs:

First DF:

id_number    name    type    city
    1        John    dev     Toronto
    2        Alex    dev     Toronto
    3        Tyler   dev     Toronto
    4        David   dev     Toronto
    5        Chloe   dev     Toronto

Second DF:

id_number    name    type    city
    1        John    boss    Vancouver
    2        Alex    dev     Vancouver
    4        David   boss    Toronto
    5        Chloe   dev     Toronto
    6        Kyle    dev     Vancouver

I want the resulting df to contain the following records:

id_number    name    type    city
    1        John    dev     Toronto
    1        John    boss    Vancouver
    2        Alex    dev     Toronto
    2        Alex    dev     Vancouver
    4        David   dev     Toronto
    4        David   Boss    Toronto

NOTE: I would not want records with id_number 5 to appear in the resulting df, that is because the records with id_number 5 are exactly the same in both dfs.

In reality, there are 80 columns for each record, but I think these tables make my point a little clearer. Again to summarize, I want the resulting df to contain records with same id_numbers, but a different value in any of the other columns. Thanks in advance for any help!

Upvotes: 3

Views: 557

Answers (2)

Scott Boston
Scott Boston

Reputation: 153470

Here is, a way using pd.concat, drop_duplicates and duplicated:

pd.concat([df1, df2]).drop_duplicates(keep=False).sort_values('id_number')\
  .loc[lambda x: x.id_number.duplicated(keep=False)]

Output:

   id_number   name  type       city
0          1   John   dev    Toronto
0          1   John  boss  Vancouver
1          2   Alex   dev    Toronto
1          2   Alex   dev  Vancouver
3          4  David   dev    Toronto
2          4  David  boss    Toronto

Upvotes: 1

BENY
BENY

Reputation: 323306

Here is one way using nunique then we pick those id_number more than 1 and slice them out

s = pd.concat([df1, df2])
s = s.loc[s.id_number.isin(s.groupby(['id_number']).nunique().gt(1).any(1).loc[lambda x : x].index)]
s
Out[654]: 
   id_number   name  type       city
0          1   John   dev    Toronto
1          2   Alex   dev    Toronto
3          4  David   dev    Toronto
0          1   John  boss  Vancouver
1          2   Alex   dev  Vancouver
2          4  David  boss    Toronto

Upvotes: 2

Related Questions