DataScience99
DataScience99

Reputation: 369

Check if records with same value in one column are the same throughout

I have a pandas df called with ~1 million records. The df has over 80 columns, with one of those columns being asset_id. I want to create a subset of all the records which have duplicate asset_ids, but have a different value in at least one of the other columns.

Example:

df = pd.DataFrame({"asset_id": [1,1,1,2,2,3,4,5,5], "Name":["Canola", "Canola", "Canola", "Precision", "Precision", "Explore", "Testing", "Conda", "Conda Inc"], "Country":["CAN", "CAN", "USA", "CAN", "CAN", "USA", "CAN", "USA", "USA"]}) 
asset_id    Name    Country 
  1       Canola     CAN
  1       Canola     CAN
  1       Canola     USA
  2       Precision  CAN
  2       Precision  CAN
  3       Explore    USA
  4       Testing    CAN
  5       Conda      USA
  5       Conda Inc  USA

I would want the resulting table to look like this:

asset_id    Name    Country
  1       Canola     CAN
  1       Canola     USA
  5       Conda      USA
  5       Conda Inc  USA

Thanks in advance for all the help!

Upvotes: 1

Views: 1272

Answers (3)

Gravity Mass
Gravity Mass

Reputation: 605

Use drop_duplicates(). It gets the job done.

import pandas as pd

df = pd.DataFrame(
    {
        "asset_id": [1, 1, 1, 2, 2, 3, 4, 5, 5],
        "Name": [
            "Canola",
            "Canola",
            "Canola",
            "Precision",
            "Precision",
            "Explore",
            "Testing",
            "Conda",
            "Conda Inc",
        ],
        "Country": ["CAN", "CAN", "USA", "CAN", "CAN", "USA", "CAN", "USA", "USA"],
    }
)


df = df.drop_duplicates()
x = df["asset_id"].value_counts()
data = []
for elem, elem1 in zip(x.index, x):
    if elem1 > 1:
        y = df.loc[df["asset_id"] == elem]
        print(y.values)

It produces the list of what you wanted (the code above produces it):

[[5 'Conda' 'USA']
 [5 'Conda Inc' 'USA']]
[[1 'Canola' 'CAN']
 [1 'Canola' 'USA']]

Upvotes: 1

Michael Silverstein
Michael Silverstein

Reputation: 1843

You could directly filter out the groups that have more than one Name or more than one Country and then drop any remaining duplicates with:

df.groupby('asset_id').filter(lambda x: (x.Name.nunique()>1) | (x.Country.nunique()>1)).drop_duplicates()

Output:

asset_id    Name    Country
0   1   Canola  CAN
2   1   Canola  USA
7   5   Conda   USA
8   5   Conda Inc   USA

Upvotes: 3

Valentino
Valentino

Reputation: 7361

You can create a custom function to do the selection and use it with groupby and apply.

def selecting(x):
    lencol = set(len(x[col].unique()) for col in x.columns)
    if len(lencol) == 1:
        return pd.DataFrame(columns=x.columns) #empty dataframe
    else:
        return x[~x.duplicated()]

ddf = df.groupby('asset_id').apply(selecting)

If you drop the indexes created by groupby you get:

ddf.reset_index(drop=True)

  asset_id       Name Country
0        1     Canola     CAN
1        1     Canola     USA
2        5      Conda     USA
3        5  Conda Inc     USA

Explanation

lencol is a set storing how many unique elements has each column. Being a set, columns with same number of elements does not appear.
Hence if len(lencol) is 1 (the set has one element) an empty dataframe is returned. Otherwise, a dataframe without duplicate rows is returned. Check the duplicated method to undersand how it works.

Upvotes: 1

Related Questions