user3489010
user3489010

Reputation: 11

In Altair/Vega-lite how to show percentage of grouped category instead of total?

Newbie using Altair/Vega-lite and struggling a bit to "get" the transformations and calculations and encoding way of thinking, especially for more complex/nested data.

Specifically, I am trying to create a super simple layered histogram, that shows the salary distribution of different countries.

So far I was able to get on the Y Axis the percentage of occurrences compared to the total:


salaries = {
    'NL': np.random.normal(loc=80000, scale=30000, size=(500,)),
    'ES': np.random.normal(loc=80000, scale=30000, size=(50,))
}
source = pd.DataFrame({k:pd.Series(v) for k,v in salaries.items()})

c = alt.Chart(source).transform_fold(
   ['NL', 'ES'],
   as_=['Benchmark', 'Salaries']
   ).transform_joinaggregate(
       total='count(*)',
       groupby=['Benchmark']
   ).transform_calculate(
       pct='1/ datum.total'
   ).mark_bar(opacity=0.3, binSpacing=0
   ).encode(
       alt.Color('Benchmark:N'),
       x=alt.X('Salaries:Q', bin=alt.Bin(maxbins=20)),
       y=alt.Y('sum(pct):Q', axis=alt.Axis(format='%'), stack=None)
   )

which results in:

total percentage

However, I'd like the percentage to be applicable to each category instead of the total. So, in this example, on the Y axis the second distribution should show percentages in the same level as the first one, as they are identical normal distributions.

I hope it's clear enough, apologies for probably lacking the statistical theory and glossary to explain things better.

Upvotes: 1

Views: 668

Answers (1)

debbes
debbes

Reputation: 942

It is grouped per category but the problem here is that your 'ES' column has 450 nan values, which are still counted in the count() I guess, so your % for the actual values is very low. One way to solve this is to use alt.Chart(source.dropna()). which would yield the plot below. plot

Upvotes: 2

Related Questions