Reputation: 30256
problem: I'm grouping results in my DataFrame, look at value_counts(normalize=True)
and try to plot the result in a barplot.
The problem is that the barplot should contain frequencies. In some groups, some values don't occur. In that case, the corresponding value_count
is not 0, it doesn't exist. For the barplot, this 0 value is not taken into account and the resulting bar is too big.
example: Here is a minimal example, which illustrates the problem: Let's say the DataFrame contains observations for experiments. When you perform such an experiment, a series of observations is collected. The result of the experiment are the relative frequencies of the observations collected for it.
df = pd.DataFrame()
df["id"] = [1]*3 + [2]*3 + [3]*3
df["experiment"] = ["a"]*6 + ["b"] * 3
df["observation"] = ["positive"]*3 + ["positive"]*2 + ["negative"]*1 + ["positive"]*2 + ["negative"]*1
So here, experiment a has been done 2 times, experiment b just once.
I need to group by id and experiment, then average the result.
plot_frame = pd.DataFrame(df.groupby(["id", "experiment"])["observation"].value_counts(normalize=True))
plot_frame = plot_frame.rename(columns={"observation":"percentage"})
In the picture above, you can already see the problem. The evaluation with id 1 has seen only positive observations. The relative frequency of "negative" should be 0. Instead, it doesn't exist. If I plot this, the corresponding bar is too high, the blue bars should add up to one:
sns.barplot(data=plot_frame.reset_index(),
x="observation",
hue="experiment",
y="percentage")
plt.show()
Upvotes: 4
Views: 1676
Reputation: 6388
You can add rows filled with 0 by using unstack
/stack
method with argument fill_value=0
. Try this:
df.groupby(["id", "experiment"])["observation"].value_counts(normalize=True).unstack(fill_value=0).stack()
Upvotes: 5
Reputation: 30256
I have found a hacky solution, by iterating over the index and manually filling in the missing values:
for a,b,_ in plot_frame.index:
if (a,b,"negative") not in plot_frame.index:
plot_frame.loc[(a,b,"negative"), "percentage"] = 0
Now this produces the desired plot:
I don't particularly like this solution, since it is very specific to my index and probably doesn't scale well if the categories become more complex
Upvotes: 0