Reputation: 2572
I have a set of rows that I want to group by the value of an identifier - present in each row - and then do further isolated processing on the groups that will be the result of this.
My dataframe looks like this:
In [50]: df
Out[50]:
groupkey b c d e date
0 C1 b1 c1 d1 e1 2014-10-26 12:13:14
1 C2 NaN c2 d2 e2 2014-11-02 12:13:14
2 C1 b3 c3 d3 e3 2014-11-09 12:13:14
3 C1 b4 NaN d4 e4 2014-11-16 12:13:14
4 C3 b5 c5 d5 e5 2014-11-23 12:13:14
5 C2 b6 c6 d6 e6 2014-11-30 12:13:14
and if I were to group by groupkey
I know I should just work on the GroupBy
returned by:
>> df.groupby('groupkey')
However, before grouping and for the parallel purpose of reducing the size of my dataset, I want to not consider any rows that would only have one record per group (if grouped in the above described manner).
In my example that would mean that row 4
should be left out.
Now, it seems to me that the easiest way to count the records per group, would of course entail grouping first and then counting the records, like that:
>> df.groupby('groupkey').count()
I suppose I could do this and then drop the groups that only have one record.
I am not sure how to fix this without having to manually go back and drop the groups that have one record only.
I was wondering if there's a way to group by some function that will allow me to take this condition into consideration while grouping?
Thanks for the help
Upvotes: 2
Views: 263
Reputation: 863226
I think you can first filter DataFrame
by value_counts
with map
and boolean indexing
:
print df.groupkey.value_counts() != 1
C1 True
C2 True
C3 False
Name: groupkey, dtype: bool
print df.groupkey.map(df.groupkey.value_counts() != 1)
0 True
1 True
2 True
3 True
4 False
5 True
Name: groupkey, dtype: bool
print df[df.groupkey.map(df.groupkey.value_counts() != 1)]
groupkey b c d e date
0 C1 b1 c1 d1 e1 2014-10-26 12:13:14
1 C2 NaN c2 d2 e2 2014-11-02 12:13:14
2 C1 b3 c3 d3 e3 2014-11-09 12:13:14
3 C1 b4 NaN d4 e4 2014-11-16 12:13:14
5 C2 b6 c6 d6 e6 2014-11-30 12:13:14
Interesting, it is faster as filter
solution (len(df)=6k
):
df = pd.concat([df]*1000).reset_index(drop=True)
In [21]: %timeit df[df.groupkey.map(df.groupkey.value_counts() != 1)]
1000 loops, best of 3: 1.87 ms per loop
In [22]: %timeit df.groupby('groupkey').filter(lambda x: len(x) != 1)
100 loops, best of 3: 2.71 ms per loop
Upvotes: 2
Reputation: 394179
You want to filter
the groupby
object using len
on the groups:
In [9]:
df.groupby('groupkey').filter(lambda x: len(x) > 1)
Out[9]:
groupkey b c d e date
0 C1 b1 c1 d1 e1 2014-10-26 12:13:14
1 C2 NaN c2 d2 e2 2014-11-02 12:13:14
2 C1 b3 c3 d3 e3 2014-11-09 12:13:14
3 C1 b4 NaN d4 e4 2014-11-16 12:13:14
5 C2 b6 c6 d6 e6 2014-11-30 12:13:14
Upvotes: 2