How to plot many unevenly distributed value_counts

Question

Lets say I have the following data:

s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
s2.value_counts(normalize=True).plot()

I want to show that there are a few numbers that make up the majority of cases. The problem is this will be seen in the far left side of the graph, and then there will be a short bar for all the other categories.

In the real data, the x axis will be categorical with about 18000 categories, and 4% of the counts will be around 10000, then the rest will drop off and be around 50.

Update: see @unutbu answer

Updated code and I'm getting an error for qcut when trying to use tuples.

TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'

df = pd.DataFrame({'s1':[1,0,1,0], 's2':[1,0,1,1], 's3':[1,0,1,1], 's4':[0,0,0,1]})
perms = df.apply(tuple, axis=1)
prob = perms.value_counts(normalize=True).reset_index(drop='True')
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.], 
                 labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()

unutbu · Accepted Answer

You could keep the normalized value counts above a certain threshold. Then sum together the values below the threshold and clump them together in one category which could be called, say, "other".

By choosing threshold high enough, you will able to display the most important contributors to the overall probability distribution, while still showing the size of the tail in the bar labeled "other":

import matplotlib.pyplot as plt
import pandas as pd

s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
prob = s2.value_counts(normalize=True)
threshold = 0.02
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar', rot=25)
plt.show()

There is a limit to the number of category labels you can sensibly display on a bar graph. For a normal-sized graph 3000 is way too many. Moreover, it is probably not reasonable to expect an audience to glean any meaning out of reading 3000 labels.

The graph should summarize the data. And the main point seems to be that 4 or 5% of the categories constitute the vast majority of the cases. So to drive home that point, perhaps use pd.qcut to categorize the cases into simple categories such as bottom 25%, mid 70%, and top 5%:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

N = 18000
categories = np.arange(N)
np.random.shuffle(categories)
M = int(N*0.04)
prob = pd.Series(np.concatenate([np.random.randint(9000, 11000, size=M),
                      np.random.randint(0, 100, size=N-M), ]), index=categories)
prob /= prob.sum()
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.], 
                 labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar', rot=0)
plt.show()

How to plot many unevenly distributed value_counts

Answers (2)

Related Questions