Reputation: 369
I have a pandas df called with ~1 million records. The df has over 80 columns, with one of those columns being asset_id. I want to create a subset of all the records which have duplicate asset_ids, but have a different value in at least one of the other columns.
Example:
df = pd.DataFrame({"asset_id": [1,1,1,2,2,3,4,5,5], "Name":["Canola", "Canola", "Canola", "Precision", "Precision", "Explore", "Testing", "Conda", "Conda Inc"], "Country":["CAN", "CAN", "USA", "CAN", "CAN", "USA", "CAN", "USA", "USA"]})
asset_id Name Country
1 Canola CAN
1 Canola CAN
1 Canola USA
2 Precision CAN
2 Precision CAN
3 Explore USA
4 Testing CAN
5 Conda USA
5 Conda Inc USA
I would want the resulting table to look like this:
asset_id Name Country
1 Canola CAN
1 Canola USA
5 Conda USA
5 Conda Inc USA
Thanks in advance for all the help!
Upvotes: 1
Views: 1272
Reputation: 605
Use drop_duplicates()
. It gets the job done.
import pandas as pd
df = pd.DataFrame(
{
"asset_id": [1, 1, 1, 2, 2, 3, 4, 5, 5],
"Name": [
"Canola",
"Canola",
"Canola",
"Precision",
"Precision",
"Explore",
"Testing",
"Conda",
"Conda Inc",
],
"Country": ["CAN", "CAN", "USA", "CAN", "CAN", "USA", "CAN", "USA", "USA"],
}
)
df = df.drop_duplicates()
x = df["asset_id"].value_counts()
data = []
for elem, elem1 in zip(x.index, x):
if elem1 > 1:
y = df.loc[df["asset_id"] == elem]
print(y.values)
It produces the list
of what you wanted (the code above produces it):
[[5 'Conda' 'USA']
[5 'Conda Inc' 'USA']]
[[1 'Canola' 'CAN']
[1 'Canola' 'USA']]
Upvotes: 1
Reputation: 1843
You could directly filter out the groups that have more than one Name
or more than one Country
and then drop any remaining duplicates with:
df.groupby('asset_id').filter(lambda x: (x.Name.nunique()>1) | (x.Country.nunique()>1)).drop_duplicates()
Output:
asset_id Name Country
0 1 Canola CAN
2 1 Canola USA
7 5 Conda USA
8 5 Conda Inc USA
Upvotes: 3
Reputation: 7361
You can create a custom function to do the selection and use it with groupby
and apply
.
def selecting(x):
lencol = set(len(x[col].unique()) for col in x.columns)
if len(lencol) == 1:
return pd.DataFrame(columns=x.columns) #empty dataframe
else:
return x[~x.duplicated()]
ddf = df.groupby('asset_id').apply(selecting)
If you drop the indexes created by groupby
you get:
ddf.reset_index(drop=True)
asset_id Name Country
0 1 Canola CAN
1 1 Canola USA
2 5 Conda USA
3 5 Conda Inc USA
lencol
is a set
storing how many unique elements has each column. Being a set, columns with same number of elements does not appear.
Hence if len(lencol)
is 1 (the set
has one element) an empty dataframe is returned. Otherwise, a dataframe without duplicate rows is returned. Check the duplicated method to undersand how it works.
Upvotes: 1