rhz
rhz

Reputation: 1132

histogram/value counts from pandas dataframe columns with categorical data and custom "bins"

Consider the following dataframe:

import pandas as pd

x = pd.DataFrame([[ 'a', 'b'], ['a', 'c'], ['c', 'b'], ['d', 'c']])
print(x)

   0  1
0  a  b
1  a  c
2  c  b
3  d  c

I would like to obtain the relative frequencies of the data in each column of the dataframe based on some custom "bins" which would be (a possible super-set of) the unique data values. For example, if:

b = ['a', 'b', 'c', 'd', 'e', 'f']

I would like to obtain:

   0  1
a  2  0
b  0  2
c  1  2
d  1  0
e  0  0
f  0  0

Is there a one (or two) liner to achieve this?

Upvotes: 0

Views: 414

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35626

Try apply value_counts, then reindex based on b:

import pandas as pd

x = pd.DataFrame([['a', 'b'], ['a', 'c'], ['c', 'b'], ['d', 'c']])

b = ['a', 'b', 'c', 'd', 'e', 'f']
df = x.apply(lambda s: s.value_counts()).reindex(b).fillna(0).astype(int)

print(df)

df:

   0  1
a  2  0
b  0  2
c  1  2
d  1  0
e  0  0
f  0  0

A melt and crosstab option:

import pandas as pd

x = pd.DataFrame([['a', 'b'], ['a', 'c'], ['c', 'b'], ['d', 'c']])

b = ['a', 'b', 'c', 'd', 'e', 'f']
df = x.melt()
df = pd.crosstab(df['value'], df['variable']) \
    .reindex(b).fillna(0).astype(int) \
    .rename_axis(None, axis=1).rename_axis(None, axis=0)

print(df)

df:

   0  1
a  2  0
b  0  2
c  1  2
d  1  0
e  0  0
f  0  0

Upvotes: 1

Related Questions