iskandarblue
iskandarblue

Reputation: 7526

Group by aggregation for missing values in range

I have a pandas dataframe test whose values I would like to turn into percentiles of all integers in categories, for example:

import pandas as pd

categories = [0,1,2,3,4,5,6,7,8,9,10]


test  
id  value
foo  0
foo  0
foo  1
foo  1
foo  5
foo  4
foo  4
foo  4
foo  3
foo  3
bar  2
bar  2
bar  2
bar  2
bar  2
bar  6
bar  6
bar  6
bar  6
bar  6

The problem I am having is mapping 0 percentiles to all possible integers in categories. I when I try

test.groupby('id')['value'].apply(lambda x: x.value_counts(normalize=True)).unstack().fillna(0)

The following dataframe is returned, but it is missing values 7, 8, 9, 10, etc. because they are not contained in each id:

    0   1   2   3   4   5   6
id                          
bar 0.0 0.0 0.5 0.0 0.0 0.0 0.5
foo 0.2 0.2 0.0 0.2 0.3 0.1 0.0

Is there a efficient way of adding all values of catgories into the value_count aggregation function so that the following result is returned?

    0   1   2   3   4   5   6   7   8   9  10
foo 0.2 0.2 0.0 0.2 0.3 0.1 0.5 0   0   0   0
bar 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0   0   0   0

Upvotes: 4

Views: 117

Answers (2)

piRSquared
piRSquared

Reputation: 294508

Categorical

df['value'] = pd.Categorical(df.value, categories)

df.groupby(['id', 'value']).size().unstack()

value  0  1  2  3  4  5  6  7  8  9  10
id                                     
bar    0  0  5  0  0  0  5  0  0  0   0
foo    2  2  0  2  3  1  0  0  0  0   0

Over engineered

i, r = pd.factorize(df['id'].to_numpy())
j, c = pd.factorize(df['value'].to_numpy())
n, m = r.size, c.size

b = np.zeros((n, 11), np.int64)

np.add.at(b, (i, j), 1)

pd.DataFrame(b, r, range(11))

     0   1   2   3   4   5   6   7   8   9   10
foo   2   2   1   3   2   0   0   0   0   0   0
bar   0   0   0   0   0   5   5   0   0   0   0

Upvotes: 3

Quang Hoang
Quang Hoang

Reputation: 150785

Try with, not necessarily more efficient but more readable:

(pd.crosstab(df['id'], df['value'])
   .reindex(categories, axis=1, fill_value=0)
)

Upvotes: 3

Related Questions