Thanos
Thanos

Reputation: 2572

How to apply a pre-condition to GroupBy or how to ignore groups with one record in GroupBy

I have a set of rows that I want to group by the value of an identifier - present in each row - and then do further isolated processing on the groups that will be the result of this.

My dataframe looks like this:

In [50]: df
Out[50]: 
  groupkey    b    c   d   e                date
0       C1   b1   c1  d1  e1 2014-10-26 12:13:14
1       C2  NaN   c2  d2  e2 2014-11-02 12:13:14
2       C1   b3   c3  d3  e3 2014-11-09 12:13:14
3       C1   b4  NaN  d4  e4 2014-11-16 12:13:14
4       C3   b5   c5  d5  e5 2014-11-23 12:13:14
5       C2   b6   c6  d6  e6 2014-11-30 12:13:14

and if I were to group by groupkey I know I should just work on the GroupBy returned by:

>> df.groupby('groupkey')

However, before grouping and for the parallel purpose of reducing the size of my dataset, I want to not consider any rows that would only have one record per group (if grouped in the above described manner).

In my example that would mean that row 4 should be left out.

Now, it seems to me that the easiest way to count the records per group, would of course entail grouping first and then counting the records, like that:

>> df.groupby('groupkey').count()

I suppose I could do this and then drop the groups that only have one record.

  1. I am not sure how to fix this without having to manually go back and drop the groups that have one record only.

  2. I was wondering if there's a way to group by some function that will allow me to take this condition into consideration while grouping?

Thanks for the help

Upvotes: 2

Views: 263

Answers (2)

jezrael
jezrael

Reputation: 863226

I think you can first filter DataFrame by value_counts with map and boolean indexing:

print df.groupkey.value_counts() != 1
C1     True
C2     True
C3    False
Name: groupkey, dtype: bool

print df.groupkey.map(df.groupkey.value_counts() != 1)
0     True
1     True
2     True
3     True
4    False
5     True
Name: groupkey, dtype: bool

print df[df.groupkey.map(df.groupkey.value_counts() != 1)]
  groupkey    b    c   d   e                 date
0       C1   b1   c1  d1  e1  2014-10-26 12:13:14
1       C2  NaN   c2  d2  e2  2014-11-02 12:13:14
2       C1   b3   c3  d3  e3  2014-11-09 12:13:14
3       C1   b4  NaN  d4  e4  2014-11-16 12:13:14
5       C2   b6   c6  d6  e6  2014-11-30 12:13:14

Interesting, it is faster as filter solution (len(df)=6k):

df = pd.concat([df]*1000).reset_index(drop=True)

In [21]: %timeit df[df.groupkey.map(df.groupkey.value_counts() != 1)]
1000 loops, best of 3: 1.87 ms per loop

In [22]: %timeit df.groupby('groupkey').filter(lambda x: len(x) != 1)
100 loops, best of 3: 2.71 ms per loop

Upvotes: 2

EdChum
EdChum

Reputation: 394179

You want to filter the groupby object using len on the groups:

In [9]:
df.groupby('groupkey').filter(lambda x: len(x) > 1)

Out[9]:
  groupkey    b    c   d   e                date
0       C1   b1   c1  d1  e1 2014-10-26 12:13:14
1       C2  NaN   c2  d2  e2 2014-11-02 12:13:14
2       C1   b3   c3  d3  e3 2014-11-09 12:13:14
3       C1   b4  NaN  d4  e4 2014-11-16 12:13:14
5       C2   b6   c6  d6  e6 2014-11-30 12:13:14

Upvotes: 2

Related Questions