Removing all rows of a duplicate based on value of multiple columns

Question

I have a large dataframe with multiple columns and many rows (200k). I order the rows by a group variable, and each group can have one or more entries. The other columns for each group should have identical values, however in some cases they dont. It looks like this:

group   name    age    color
1       Anton   50     orange
1       Anton   21     red
1       Anton   21     red
2       Martin  78     black
2       Martin  78     blue
3       Maria   25     red
3       Maria   29     pink
4       Jake    33     blue

I want to delte all entries of a group, if either age or color is not identical for all rows of the group.(indicating observation error) However i want to keep duplicated entries if all columns have the same value. So the output im hoping for would be:

group   name    age    color   
2       Martin  78     black
2       Martin  78     blue  
4       Jake    33     blue

In a similar case I was using this function, which works very fast: df = df.groupby('group').filter(lambda x: x.count() == 1)

However this does not allow me to check for the value of the columns (age,color). I've been playing around with the groupby functionality, but cannot seem to grasp it.

/e: I just realized that I missed an important condition in my question: I only want to drop the observations, if one or several SPECIFIC columns have duplicate values. Other columns can be different however. In the example above, lets say i dont care if there is a difference between color within a group, but only want to check if the age has a different value.(I edited the example to reflect this).My actual case is more general and contains more columns, so i want e.g. to check a few columns and ignore others when dropping observations.

chrisb · Accepted Answer

While @ismax's answer will work, you can use a similar pattern to your .count() solution, but dropping duplicates first.

In [229]: In [179]: df.groupby('group').filter(lambda x: len(x.drop_duplicates(subset=['age'])) == 1)
Out[229]: 
   group    name  age  color
3      2  Martin   78  black
4      2  Martin   78   blue
7      4    Jake   33   blue

Removing all rows of a duplicate based on value of multiple columns

Answers (2)

Related Questions