cover51
cover51

Reputation: 107

Removing all rows of a duplicate based on value of multiple columns

I have a large dataframe with multiple columns and many rows (200k). I order the rows by a group variable, and each group can have one or more entries. The other columns for each group should have identical values, however in some cases they dont. It looks like this:

group   name    age    color
1       Anton   50     orange
1       Anton   21     red
1       Anton   21     red
2       Martin  78     black
2       Martin  78     blue
3       Maria   25     red
3       Maria   29     pink
4       Jake    33     blue

I want to delte all entries of a group, if either age or color is not identical for all rows of the group.(indicating observation error) However i want to keep duplicated entries if all columns have the same value. So the output im hoping for would be:

group   name    age    color   
2       Martin  78     black
2       Martin  78     blue  
4       Jake    33     blue

In a similar case I was using this function, which works very fast: df = df.groupby('group').filter(lambda x: x.count() == 1)

However this does not allow me to check for the value of the columns (age,color). I've been playing around with the groupby functionality, but cannot seem to grasp it.

/e: I just realized that I missed an important condition in my question: I only want to drop the observations, if one or several SPECIFIC columns have duplicate values. Other columns can be different however. In the example above, lets say i dont care if there is a difference between color within a group, but only want to check if the age has a different value.(I edited the example to reflect this).My actual case is more general and contains more columns, so i want e.g. to check a few columns and ignore others when dropping observations.

Upvotes: 0

Views: 404

Answers (2)

chrisb
chrisb

Reputation: 52236

While @ismax's answer will work, you can use a similar pattern to your .count() solution, but dropping duplicates first.

In [229]: In [179]: df.groupby('group').filter(lambda x: len(x.drop_duplicates(subset=['age'])) == 1)
Out[229]: 
   group    name  age  color
3      2  Martin   78  black
4      2  Martin   78   blue
7      4    Jake   33   blue

Upvotes: 2

Ismael EL ATIFI
Ismael EL ATIFI

Reputation: 2108

You can solve this using a dict of counters.

from collections import defaultdict, Counter

N = int(input())#read number of tuples
mapGroupAge = defaultdict(Counter)#a dict of counters to count 
                                  #the repetitions by group

for _ in range(N):
    # read tuples (from standard input in this example)
    group,name,age,color = input().split()
    #build the map (dict) indexed by the groups i.e. a key is a pair (group,name)
    mapGroupAge[(group,name)][(age,color)] += 1

for (group,name), counter in mapGroupAge.items():
    # if all ages and colors for the same group are the same
    if(len(counter)==1):
        age,color = list(counter.keys())[0]
        # print all the repetitions
        for _ in range(counter[(age,color)]):
            print(group, name, age,color)

You can test the code above by executing it and pasting the following lines in the standard input :

8
1       Anton   50     orange
1       Anton   21     red
1       Anton   21     red
2       Martin  78     blue
2       Martin  78     blue
3       Maria   25     red
3       Maria   25     pink
4       Jake    33     blue

As you wanted, the result of the execution is :

2 Martin 78 blue
2 Martin 78 blue
4 Jake 33 blue

Upvotes: 0

Related Questions