Reputation: 107
I have a large dataframe with multiple columns and many rows (200k). I order the rows by a group variable, and each group can have one or more entries. The other columns for each group should have identical values, however in some cases they dont. It looks like this:
group name age color
1 Anton 50 orange
1 Anton 21 red
1 Anton 21 red
2 Martin 78 black
2 Martin 78 blue
3 Maria 25 red
3 Maria 29 pink
4 Jake 33 blue
I want to delte all entries of a group, if either age or color is not identical for all rows of the group.(indicating observation error) However i want to keep duplicated entries if all columns have the same value. So the output im hoping for would be:
group name age color
2 Martin 78 black
2 Martin 78 blue
4 Jake 33 blue
In a similar case I was using this function, which works very fast: df = df.groupby('group').filter(lambda x: x.count() == 1)
However this does not allow me to check for the value of the columns (age,color). I've been playing around with the groupby functionality, but cannot seem to grasp it.
/e: I just realized that I missed an important condition in my question: I only want to drop the observations, if one or several SPECIFIC columns have duplicate values. Other columns can be different however. In the example above, lets say i dont care if there is a difference between color within a group, but only want to check if the age has a different value.(I edited the example to reflect this).My actual case is more general and contains more columns, so i want e.g. to check a few columns and ignore others when dropping observations.
Upvotes: 0
Views: 404
Reputation: 52236
While @ismax's answer will work, you can use a similar pattern to your .count()
solution, but dropping duplicates first.
In [229]: In [179]: df.groupby('group').filter(lambda x: len(x.drop_duplicates(subset=['age'])) == 1)
Out[229]:
group name age color
3 2 Martin 78 black
4 2 Martin 78 blue
7 4 Jake 33 blue
Upvotes: 2
Reputation: 2108
You can solve this using a dict of counters.
from collections import defaultdict, Counter
N = int(input())#read number of tuples
mapGroupAge = defaultdict(Counter)#a dict of counters to count
#the repetitions by group
for _ in range(N):
# read tuples (from standard input in this example)
group,name,age,color = input().split()
#build the map (dict) indexed by the groups i.e. a key is a pair (group,name)
mapGroupAge[(group,name)][(age,color)] += 1
for (group,name), counter in mapGroupAge.items():
# if all ages and colors for the same group are the same
if(len(counter)==1):
age,color = list(counter.keys())[0]
# print all the repetitions
for _ in range(counter[(age,color)]):
print(group, name, age,color)
You can test the code above by executing it and pasting the following lines in the standard input :
8
1 Anton 50 orange
1 Anton 21 red
1 Anton 21 red
2 Martin 78 blue
2 Martin 78 blue
3 Maria 25 red
3 Maria 25 pink
4 Jake 33 blue
As you wanted, the result of the execution is :
2 Martin 78 blue
2 Martin 78 blue
4 Jake 33 blue
Upvotes: 0