Nithin Reddy
Nithin Reddy

Reputation: 688

How to iterate a row and compare with each other?

For example, if the name 'Mark' is repeated and if the 'Age' value or 'Gender' value is different when compared to other rows which contain 'Mark', then such columns must be removed.

Code for creating dataframe:

df = pd.DataFrame({'Name' : ['Mark', 'Mark', 'Mark', 'Mark', 'Mark', 'Nick', 'Nick', 'John', 'Sunny', 'Sunny'], 
                  'Age' : ['22', '22', '25', '25', '17', '20', '20', '17', '23', '23'],
                  'Gender' : ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M']})

The dataframe is below:

       Name   Age   Gender
0      Mark   22    F
1      Mark   22    F
2      Mark   25    F
3      Mark   25    F
4      Mark   17    F
5      Nick   20    F
6      Nick   20    F
7      John   17    M
8      Sunny  23    M
9      Sunny  23    M

The expected output is:

       Name   Age   Gender
0      Nick   20    F
1      Nick   20    F
2      John   17    M
3      Sunny  23    M
4      Sunny  23    M

For example in the 1st and 2nd rows Name is Mark and Age is 22 and Gender is F. Whereas in 3rd-row Name is Mark, Age is 25 and Gender is F, we can see Age has multiple values, then we have to remove rows which contains Mark

Upvotes: 2

Views: 538

Answers (2)

Joshua Voskamp
Joshua Voskamp

Reputation: 2054

A one-liner answer:

df[df.Name.isin((s:=df.drop_duplicates().groupby(['Name']).size().eq(1)).index[s])]

results in:

    Name Age Gender
5   Nick  20      F
6   Nick  20      F
7   John  17      M
8  Sunny  23      M
9  Sunny  23      M

Perhaps worth noting: the walrus-operator does still allocate memory for a variable s that Python's garbage-collector will eventually de-allocate. If you would need a Series representing whether any Name has different rows, as in below, then it'd be worth explicitly assigning it.

>>> s
Name
John      True
Mark     False
Nick      True
Sunny     True
dtype: bool

anywhere else in your calculations, it's likely worth explicitly assigning it to a variable to which you maintain a reference.


Explanation:

# from the OG df
df[
   # boolean mask if the name is in
   df.Name.isin(
                # walrus-operator to temporarily hold result
                # drop the duplicate rows (duplicates name + age + gender)
                (s:=df.drop_duplicates()
                       # group on Name
                       .groupby(['Name'])
                       # after dropping duplicates, there's only one record
                       # i.e. no different age/gender records for same name
                       .size().eq(1))
                # mask on names where no-duplicates is true, drop names where false
                ).index[s]
               # pass as series to `df.Name.isin`
               )
  ]

Upvotes: 1

Tim Roberts
Tim Roberts

Reputation: 54698

OK, this is tricky, so I'll explain each line.

import pandas as pd

df = pd.DataFrame({'Name' : ['Mark', 'Mark', 'Mark', 'Mark', 'Mark', 'Nick', 'Nick', 'John', 'Sunny', 'Sunny'], 
                  'Age' : ['22', '22', '25', '25', '17', '20', '20', '17', '23', '23'],
                  'Gender' : ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M']})

print(df)

# First, drop all the rows that are exact duplicates of each other.

df1 = df.drop_duplicates()
print(df1)

# If any rows that remain have a duplicate name, those need to go.

df2 = df1.duplicated('Name')
print(df2)

# Go build a DF with just the names that were duplicated.

df3 = df1[df2]['Name']
print(df3)

# Find all the rows in the original df that have a name in this list,
# invert that set, and the result is what we want.

df4 = df[~df.Name.isin(df3)]
print(df4)

Several of those operations can be combined to make a one-liner, but I think this makes it more clear. I've added blank lines to separate the dataframes in the output.

Output:

    Name Age Gender
0   Mark  22      F
1   Mark  22      F
2   Mark  25      F
3   Mark  25      F
4   Mark  17      F
5   Nick  20      F
6   Nick  20      F
7   John  17      M
8  Sunny  23      M
9  Sunny  23      M

    Name Age Gender
0   Mark  22      F
2   Mark  25      F
4   Mark  17      F
5   Nick  20      F
7   John  17      M
8  Sunny  23      M

0    False
2     True
4     True
5    False
7    False
8    False
dtype: bool

2    Mark
4    Mark
Name: Name, dtype: object

    Name Age Gender
5   Nick  20      F
6   Nick  20      F
7   John  17      M
8  Sunny  23      M
9  Sunny  23      M

Upvotes: 2

Related Questions