Reputation: 688
For example, if the name 'Mark' is repeated and if the 'Age' value or 'Gender' value is different when compared to other rows which contain 'Mark', then such columns must be removed.
Code for creating dataframe:
df = pd.DataFrame({'Name' : ['Mark', 'Mark', 'Mark', 'Mark', 'Mark', 'Nick', 'Nick', 'John', 'Sunny', 'Sunny'],
'Age' : ['22', '22', '25', '25', '17', '20', '20', '17', '23', '23'],
'Gender' : ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M']})
The dataframe is below:
Name Age Gender
0 Mark 22 F
1 Mark 22 F
2 Mark 25 F
3 Mark 25 F
4 Mark 17 F
5 Nick 20 F
6 Nick 20 F
7 John 17 M
8 Sunny 23 M
9 Sunny 23 M
The expected output is:
Name Age Gender
0 Nick 20 F
1 Nick 20 F
2 John 17 M
3 Sunny 23 M
4 Sunny 23 M
For example in the 1st and 2nd rows Name is Mark and Age is 22 and Gender is F. Whereas in 3rd-row Name is Mark, Age is 25 and Gender is F, we can see Age has multiple values, then we have to remove rows which contains Mark
Upvotes: 2
Views: 538
Reputation: 2054
A one-liner answer:
df[df.Name.isin((s:=df.drop_duplicates().groupby(['Name']).size().eq(1)).index[s])]
results in:
Name Age Gender
5 Nick 20 F
6 Nick 20 F
7 John 17 M
8 Sunny 23 M
9 Sunny 23 M
Perhaps worth noting: the walrus-operator does still allocate memory for a variable s
that Python's garbage-collector will eventually de-allocate. If you would need a Series
representing whether any Name
has different rows, as in below, then it'd be worth explicitly assigning it.
>>> s
Name
John True
Mark False
Nick True
Sunny True
dtype: bool
anywhere else in your calculations, it's likely worth explicitly assigning it to a variable to which you maintain a reference.
Explanation:
# from the OG df
df[
# boolean mask if the name is in
df.Name.isin(
# walrus-operator to temporarily hold result
# drop the duplicate rows (duplicates name + age + gender)
(s:=df.drop_duplicates()
# group on Name
.groupby(['Name'])
# after dropping duplicates, there's only one record
# i.e. no different age/gender records for same name
.size().eq(1))
# mask on names where no-duplicates is true, drop names where false
).index[s]
# pass as series to `df.Name.isin`
)
]
Upvotes: 1
Reputation: 54698
OK, this is tricky, so I'll explain each line.
import pandas as pd
df = pd.DataFrame({'Name' : ['Mark', 'Mark', 'Mark', 'Mark', 'Mark', 'Nick', 'Nick', 'John', 'Sunny', 'Sunny'],
'Age' : ['22', '22', '25', '25', '17', '20', '20', '17', '23', '23'],
'Gender' : ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M']})
print(df)
# First, drop all the rows that are exact duplicates of each other.
df1 = df.drop_duplicates()
print(df1)
# If any rows that remain have a duplicate name, those need to go.
df2 = df1.duplicated('Name')
print(df2)
# Go build a DF with just the names that were duplicated.
df3 = df1[df2]['Name']
print(df3)
# Find all the rows in the original df that have a name in this list,
# invert that set, and the result is what we want.
df4 = df[~df.Name.isin(df3)]
print(df4)
Several of those operations can be combined to make a one-liner, but I think this makes it more clear. I've added blank lines to separate the dataframes in the output.
Output:
Name Age Gender
0 Mark 22 F
1 Mark 22 F
2 Mark 25 F
3 Mark 25 F
4 Mark 17 F
5 Nick 20 F
6 Nick 20 F
7 John 17 M
8 Sunny 23 M
9 Sunny 23 M
Name Age Gender
0 Mark 22 F
2 Mark 25 F
4 Mark 17 F
5 Nick 20 F
7 John 17 M
8 Sunny 23 M
0 False
2 True
4 True
5 False
7 False
8 False
dtype: bool
2 Mark
4 Mark
Name: Name, dtype: object
Name Age Gender
5 Nick 20 F
6 Nick 20 F
7 John 17 M
8 Sunny 23 M
9 Sunny 23 M
Upvotes: 2