Reputation: 7526
I have a pandas dataframe test
whose values I would like to turn into percentiles of all integers in categories
, for example:
import pandas as pd
categories = [0,1,2,3,4,5,6,7,8,9,10]
test
id value
foo 0
foo 0
foo 1
foo 1
foo 5
foo 4
foo 4
foo 4
foo 3
foo 3
bar 2
bar 2
bar 2
bar 2
bar 2
bar 6
bar 6
bar 6
bar 6
bar 6
The problem I am having is mapping 0 percentiles to all possible integers in categories. I when I try
test.groupby('id')['value'].apply(lambda x: x.value_counts(normalize=True)).unstack().fillna(0)
The following dataframe is returned, but it is missing values 7, 8, 9, 10, etc. because they are not contained in each id:
0 1 2 3 4 5 6
id
bar 0.0 0.0 0.5 0.0 0.0 0.0 0.5
foo 0.2 0.2 0.0 0.2 0.3 0.1 0.0
Is there a efficient way of adding all values of catgories
into the value_count aggregation function so that the following result is returned?
0 1 2 3 4 5 6 7 8 9 10
foo 0.2 0.2 0.0 0.2 0.3 0.1 0.5 0 0 0 0
bar 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0 0 0 0
Upvotes: 4
Views: 117
Reputation: 294508
Categorical
df['value'] = pd.Categorical(df.value, categories)
df.groupby(['id', 'value']).size().unstack()
value 0 1 2 3 4 5 6 7 8 9 10
id
bar 0 0 5 0 0 0 5 0 0 0 0
foo 2 2 0 2 3 1 0 0 0 0 0
i, r = pd.factorize(df['id'].to_numpy())
j, c = pd.factorize(df['value'].to_numpy())
n, m = r.size, c.size
b = np.zeros((n, 11), np.int64)
np.add.at(b, (i, j), 1)
pd.DataFrame(b, r, range(11))
0 1 2 3 4 5 6 7 8 9 10
foo 2 2 1 3 2 0 0 0 0 0 0
bar 0 0 0 0 0 5 5 0 0 0 0
Upvotes: 3
Reputation: 150785
Try with, not necessarily more efficient but more readable:
(pd.crosstab(df['id'], df['value'])
.reindex(categories, axis=1, fill_value=0)
)
Upvotes: 3