Matthijs
Matthijs

Reputation: 909

Pandas: DataFrame filtering using groupby and a function

Using Python 3.3 and Pandas 0.10

I have a DataFrame that is built from concatenating multiple CSV files. First, I filter out all values in the Name column that contain a certain string. The result looks something like this (shortened for brevity sakes, actually there are more columns):

Name    ID
'A'     1
'B'     2
'C'     3
'C'     3
'E'     4
'F'     4
...     ...

Now my issue is that I want to remove a special case of 'duplicate' values. I want to remove all ID duplicates (entire row actually) where the corresponding Name values that are mapped to this ID are not similar. In the example above I would like to keep rows with ID 1, 2 and 3. Where ID=4 the Name values are unequal and I want to remove those.

I tried to use the following line of code (based on the suggestion here: Python Pandas: remove entries based on the number of occurrences).

Code:

df[df.groupby('ID').apply(lambda g: len({x for x in g['Name']})) == 1]

However that gives me the error: ValueError: Item wrong length 51906 instead of 109565!

Edit:

Instead of using apply() I have also tried using transform(), however that gives me the error: AttributeError: 'int' object has no attribute 'ndim'. An explanation on why the error is different per function is very much appreciated!

Also, I want to keep keep all rows where ID = 3 in the above example.

Thanks in advance, Matthijs

Upvotes: 4

Views: 9237

Answers (2)

Andy Hayden
Andy Hayden

Reputation: 375535

You could first drop the duplicates:

In [11]: df = df.drop_duplicates()

In [12]: df
Out[12]:
  Name ID
0    A  1
1    B  2
2    C  3
4    E  4
5    F  4

The groupby id and only consider those with one element:

In [13]: g = df.groupby('ID')

In [14]: size = (g.size() == 1)

In [15]: size
Out[15]:
ID
1      True
2      True
3      True
4     False
dtype: bool

In [16]: size[size].index
Out[16]: Int64Index([1, 2, 3], dtype=int64)

In [17]: df['ID'].isin(size[size].index)
Out[17]:
0     True
1     True
2     True
4    False
5    False
Name: ID, dtype: bool

And boolean index by this:

In [18]: df[df['ID'].isin(size[size].index)]
Out[18]:
  Name ID
0    A  1
1    B  2
2    C  3

Upvotes: 0

Dan Allan
Dan Allan

Reputation: 35245

Instead of length len, I think you want to consider the number of unique values of Name in each group. Use nunique(), and check out this neat recipe for filtering groups.

df[df.groupby('ID').Name.transform(lambda x: x.nunique() == 1).astype('bool')]

If you upgrade to pandas 0.12, you can use the new filter method on groups, which makes this more succinct and straightforward.

df.groupby('ID').filter(lambda x: x.Name.nunique() == 1)

A general remark: Sometimes, of course, you do want to know the length of the group, but I find that size is a safer choice than len, which has been troublesome for me in some cases.

Upvotes: 5

Related Questions