How to iterate a row and compare with each other?

Question

For example, if the name 'Mark' is repeated and if the 'Age' value or 'Gender' value is different when compared to other rows which contain 'Mark', then such columns must be removed.

Code for creating dataframe:

df = pd.DataFrame({'Name' : ['Mark', 'Mark', 'Mark', 'Mark', 'Mark', 'Nick', 'Nick', 'John', 'Sunny', 'Sunny'], 
                  'Age' : ['22', '22', '25', '25', '17', '20', '20', '17', '23', '23'],
                  'Gender' : ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M']})

The dataframe is below:

       Name   Age   Gender
0      Mark   22    F
1      Mark   22    F
2      Mark   25    F
3      Mark   25    F
4      Mark   17    F
5      Nick   20    F
6      Nick   20    F
7      John   17    M
8      Sunny  23    M
9      Sunny  23    M

The expected output is:

       Name   Age   Gender
0      Nick   20    F
1      Nick   20    F
2      John   17    M
3      Sunny  23    M
4      Sunny  23    M

For example in the 1st and 2nd rows Name is Mark and Age is 22 and Gender is F. Whereas in 3rd-row Name is Mark, Age is 25 and Gender is F, we can see Age has multiple values, then we have to remove rows which contains Mark

Joshua Voskamp · Accepted Answer

A one-liner answer:

df[df.Name.isin((s:=df.drop_duplicates().groupby(['Name']).size().eq(1)).index[s])]

results in:

    Name Age Gender
5   Nick  20      F
6   Nick  20      F
7   John  17      M
8  Sunny  23      M
9  Sunny  23      M

Perhaps worth noting: the walrus-operator does still allocate memory for a variable s that Python's garbage-collector will eventually de-allocate. If you would need a Series representing whether any Name has different rows, as in below, then it'd be worth explicitly assigning it.

>>> s
Name
John      True
Mark     False
Nick      True
Sunny     True
dtype: bool

anywhere else in your calculations, it's likely worth explicitly assigning it to a variable to which you maintain a reference.

Explanation:

# from the OG df
df[
   # boolean mask if the name is in
   df.Name.isin(
                # walrus-operator to temporarily hold result
                # drop the duplicate rows (duplicates name + age + gender)
                (s:=df.drop_duplicates()
                       # group on Name
                       .groupby(['Name'])
                       # after dropping duplicates, there's only one record
                       # i.e. no different age/gender records for same name
                       .size().eq(1))
                # mask on names where no-duplicates is true, drop names where false
                ).index[s]
               # pass as series to `df.Name.isin`
               )
  ]

How to iterate a row and compare with each other?

Answers (2)

Related Questions