lhk
lhk

Reputation: 30256

pandas fill in 0 for non-existing categories in value_counts()

problem: I'm grouping results in my DataFrame, look at value_counts(normalize=True) and try to plot the result in a barplot.

The problem is that the barplot should contain frequencies. In some groups, some values don't occur. In that case, the corresponding value_count is not 0, it doesn't exist. For the barplot, this 0 value is not taken into account and the resulting bar is too big.

example: Here is a minimal example, which illustrates the problem: Let's say the DataFrame contains observations for experiments. When you perform such an experiment, a series of observations is collected. The result of the experiment are the relative frequencies of the observations collected for it.

df = pd.DataFrame()

df["id"] = [1]*3 + [2]*3 + [3]*3
df["experiment"] = ["a"]*6 + ["b"] * 3
df["observation"] = ["positive"]*3 + ["positive"]*2 + ["negative"]*1 + ["positive"]*2 + ["negative"]*1

dataframe

So here, experiment a has been done 2 times, experiment b just once.

I need to group by id and experiment, then average the result.

plot_frame = pd.DataFrame(df.groupby(["id", "experiment"])["observation"].value_counts(normalize=True))
plot_frame = plot_frame.rename(columns={"observation":"percentage"})

plot_frame

In the picture above, you can already see the problem. The evaluation with id 1 has seen only positive observations. The relative frequency of "negative" should be 0. Instead, it doesn't exist. If I plot this, the corresponding bar is too high, the blue bars should add up to one:

sns.barplot(data=plot_frame.reset_index(), 
            x="observation", 
            hue="experiment", 
            y="percentage")

plt.show()

barplot

Upvotes: 4

Views: 1676

Answers (2)

mbh86
mbh86

Reputation: 6388

You can add rows filled with 0 by using unstack/stack method with argument fill_value=0. Try this:

df.groupby(["id", "experiment"])["observation"].value_counts(normalize=True).unstack(fill_value=0).stack()

Upvotes: 5

lhk
lhk

Reputation: 30256

I have found a hacky solution, by iterating over the index and manually filling in the missing values:

for a,b,_ in plot_frame.index:
    if (a,b,"negative") not in plot_frame.index:
        plot_frame.loc[(a,b,"negative"), "percentage"] = 0

Now this produces the desired plot:

barplot

I don't particularly like this solution, since it is very specific to my index and probably doesn't scale well if the categories become more complex

Upvotes: 0

Related Questions