smd1000
smd1000

Reputation: 313

pandas: groupby two columns nunique

I have the following sample set.

        CustID     Condition      Month        Reading  Consumption 
0     108000601         True       June       20110606      28320.0
1     108007000         True       July       20110705      13760.0
2     108007000         True     August       20110804      16240.0
3     108008000         True  September       20110901      12560.0
4     108008000         True    October       20111004      12400.0
5     108000601        False   November       20111101       9440.0
6     108090000        False   December       20111205      12160.0
7     108008000        False    January       20120106      11360.0
8     108000601         True   February       20120206      10480.0
9     108000601         True      March       20120306       9840.0

The following groupby provides me part of what I'm looking for.

dfm.groupby(['Condition'])['CustID'].nunique()

Condition
True      3
False     3

But how do I get unique ID's that match both conditions? e.g.

Condition
True      3
False     3
Both      2

Upvotes: 1

Views: 1159

Answers (2)

jme
jme

Reputation: 20725

I'd suggest grouping on CustID. Then we can look through each group and easily determine whether each unique id has only True, only False, or both. Then we simply use Series.value_counts():

def categorize(s):
    if s.all():
        return 'True'
    elif not s.any():
        return 'False'
    else:
        return 'Both'

categorized = df.groupby('CustID')['Condition'].apply(categorize)
categorized.value_counts()

which gives

Both     2
False    1
True     1
Name: Condition, dtype: int64

Upvotes: 1

maxymoo
maxymoo

Reputation: 36545

Not sure if this is the most "pandas" way but you can use set to compare the users in each partition (the Python set data-structure is a hash table which will automatically discard duplicates):

custid_true = set(dfm[dfm['Condition']==True].CustID)
custid_false = set(dfm[dfm['Condition']==False].CustID)
custid_both = custid_true.intersection(custid_false)
n_custid_both = len(custid_both)

Upvotes: 2

Related Questions