Reputation: 679
I'm learning Python/Pandas with a DataFrame having the following structure:
df1 = pd.DataFrame({'unique_id' : [1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'brand' : ['A', 'B', 'A', 'C', 'X', 'A', 'C', 'X', 'X', 'X']})
print(df1)
unique_id brand
0 1 A
1 1 B
2 2 A
3 2 C
4 2 X
5 3 A
6 3 C
7 3 X
8 3 X
9 3 X
My goal is to make some calculations on the above DataFrame.
Specifically, for each unique_id
, I want to:
X
into account;Visually, using the above example, the resulting DataFrame I'm looking for should look like this:
unique_id count_brands_not_x count_brand_x
0 1 2 0
1 2 2 1
2 3 2 3
I have used the groupby
method on simple examples in the past but I don't know how to specify conditions in a groupby
to solve this new problem I have. Any help would be appreciated.
Upvotes: 0
Views: 90
Reputation: 143110
I would first create groups and later count elements in groups
But maybe there is better function to count items in agg()
import pandas as pd
df1 = pd.DataFrame({'unique_id' : [1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'brand' : ['A', 'B', 'A', 'C', 'X', 'A', 'C', 'X', 'X', 'X']})
g = df1.groupby('unique_id')
df = pd.DataFrame()
df['count_brand_x'] = g['brand'].agg(lambda data:sum(data=='X'))
df['count_brands_not_x'] = g['brand'].agg(lambda data:sum(data!='X'))
df = df.reset_index()
print(df)
EDIT: If I have df['count_brand_x']
then other can count
df['count_brands_not_x'] = g['brand'].count() - df['count_brand_x']
Upvotes: 1
Reputation: 25269
I use pd.crosstab
on True/False mask of comparing against value X
s = df1.brand.eq('X')
df_final = (pd.crosstab(df1.unique_id, s)
.rename({False: 'count_brands_not_x' , True: 'count_brand_x'}, axis=1))
Out[134]:
brand count_brands_not_x count_brand_x
unique_id
1 2 0
2 2 1
3 2 3
Upvotes: 3
Reputation: 42946
You can use GroupBy
and merge
:
maskx = df1['brand'].eq('X')
d1 = df1[~maskx].groupby('unique_id')['brand'].size().reset_index()
d2 = df1[maskx].groupby('unique_id')['brand'].size().reset_index()
df = d1.merge(d2, on='unique_id', how='outer', suffixes=['_not_x', '_x']).fillna(0)
unique_id brand_not_x brand_x
0 1 2 0.00
1 2 2 1.00
2 3 2 3.00
Upvotes: 4
Reputation: 59579
You can subset the original DataFrame and use the appropriate groupby
operations for each calculation. concat
joins the results.
import pandas as pd
s = df1.brand.eq('X')
res = (pd.concat([df1[~s].groupby('unique_id').brand.nunique().rename('unique_not_X'),
df1[s].groupby('unique_id').size().rename('count_X')],
axis=1)
.fillna(0))
# unique_not_X count_X
#unique_id
#1 2 0.0
#2 2 1.0
#3 2 3.0
If instead of "unique_brands" you just want the number of rows with brands that are not "X" then we can perform a single groupby
and unstack
the result.
(df1.groupby(['unique_id', df1.brand.eq('X').map({True: 'count_X', False: 'count_not_X'})])
.size().unstack(-1).fillna(0))
#brand count_X count_not_X
#unique_id
#1 0.0 2.0
#2 1.0 2.0
#3 3.0 2.0
Upvotes: 1