Reputation: 313
I have the following sample set.
CustID Condition Month Reading Consumption
0 108000601 True June 20110606 28320.0
1 108007000 True July 20110705 13760.0
2 108007000 True August 20110804 16240.0
3 108008000 True September 20110901 12560.0
4 108008000 True October 20111004 12400.0
5 108000601 False November 20111101 9440.0
6 108090000 False December 20111205 12160.0
7 108008000 False January 20120106 11360.0
8 108000601 True February 20120206 10480.0
9 108000601 True March 20120306 9840.0
The following groupby provides me part of what I'm looking for.
dfm.groupby(['Condition'])['CustID'].nunique()
Condition
True 3
False 3
But how do I get unique ID's that match both conditions? e.g.
Condition
True 3
False 3
Both 2
Upvotes: 1
Views: 1159
Reputation: 20725
I'd suggest grouping on CustID
. Then we can look through each group and easily determine whether each unique id has only True
, only False
, or both. Then we simply use Series.value_counts()
:
def categorize(s):
if s.all():
return 'True'
elif not s.any():
return 'False'
else:
return 'Both'
categorized = df.groupby('CustID')['Condition'].apply(categorize)
categorized.value_counts()
which gives
Both 2
False 1
True 1
Name: Condition, dtype: int64
Upvotes: 1
Reputation: 36545
Not sure if this is the most "pandas" way but you can use set
to compare the users in each partition (the Python set
data-structure is a hash table which will automatically discard duplicates):
custid_true = set(dfm[dfm['Condition']==True].CustID)
custid_false = set(dfm[dfm['Condition']==False].CustID)
custid_both = custid_true.intersection(custid_false)
n_custid_both = len(custid_both)
Upvotes: 2