Reputation: 73
I'm trying to count the number of each category of storm for each unique x
and y
combination. For example. My dataframe looks like:
x y year Category
1 1 1988 3
2 1 1977 1
2 1 1999 2
3 2 1990 4
I want to create a dataframe that looks like:
x y Category 1 Category 2 Category 3 Category 4
1 1 0 0 1 0
2 1 1 1 0 0
3 2 0 0 0 1
I have tried various combinations of .groupby()
and .count()
, but I am still not getting the desired result. The closet thing I could get is:
df[['x','y','Category']].groupby(['Category']).count()
However, the result counts for all x
and y
, not the unique pairs:
Cat x y
1 3773 3773
2 1230 1230
3 604 604
4 266 266
5 50 50
NA 27620 27620
TS 16884 16884
Does anyone know how to do a count operation on one column based on the uniqueness of two other columns in a dataframe?
Upvotes: 7
Views: 11269
Reputation: 153560
You can use pd.get_dummies
after setting index using set_index
, then use sum
with level
parameter to collapse rows:
pd.get_dummies(df.set_index(['x','y'])['Category'].astype(str),
prefix='Category ',
prefix_sep='')\
.sum(level=[0,1])\
.reset_index()
Output:
x y Category 1 Category 2 Category 3 Category 4
0 1 1 0 0 1 0
1 2 1 1 1 0 0
2 3 2 0 0 0 1
Upvotes: 1
Reputation: 10590
pivot_table
sounds like what you want. A bit of a hack is to add a column of 1
's to use to count. This allows pivot_table
to add 1
for each occurrence of a particular x
-y
and Category
combination. You will set this new column as your value
parameter in pivot_table
and the aggfunc
paraemter to np.sum
. You'll probably want to set fill_value
to 0
as well:
df['count'] = 1
result = df.pivot_table(
index=['x', 'y'], columns='Category', values='count',
fill_value=0, aggfunc=np.sum
)
result
:
Category 1 2 3 4
x y
1 1 0 0 1 0
2 1 1 1 0 0
3 2 0 0 0 1
If you're interested in keeping x
and y
as columns and having the other column names as Category X
, you can rename the columns and use reset_index
:
result.columns = [f'Category {x}' for x in result.columns]
result = a.reset_index()
Upvotes: 2
Reputation: 4792
You can use groupby first:
df_new = df.groupby(['x', 'y', 'Category']).count()
df_new
year count
x y Category
1 1 3 1 1
2 1 1 1 1
2 1 1
3 2 4 1 1
Then pivot_table
df_new = df_new.pivot_table(index=['x', 'y'], columns='Category', values='count', fill_value=0)
df_new
Category 1 2 3 4
x y
1 1 0 0 1 0
2 1 1 1 0 0
3 2 0 0 0 1
Upvotes: 1
Reputation: 71620
Or use groupby
twice, with a lot of additional, i.e get_dummies
with apply
etc...
Like:
>>> df.join(df.groupby(['x','y'])['Category']
.apply(lambda x: x.astype(str).str.get_dummies().add_prefix('Category ')))
.groupby(['x','y']).sum().fillna(0).drop(['year','Category'],1).reset_index()
x y Category 1 Category 2 Category 3 Category 4
0 1 1 0.0 0.0 1.0 0.0
1 2 1 1.0 1.0 0.0 0.0
2 3 2 0.0 0.0 0.0 1.0
>>>
Upvotes: 0