Reputation: 11
Newbie using Altair/Vega-lite and struggling a bit to "get" the transformations and calculations and encoding way of thinking, especially for more complex/nested data.
Specifically, I am trying to create a super simple layered histogram, that shows the salary distribution of different countries.
So far I was able to get on the Y Axis the percentage of occurrences compared to the total:
salaries = {
'NL': np.random.normal(loc=80000, scale=30000, size=(500,)),
'ES': np.random.normal(loc=80000, scale=30000, size=(50,))
}
source = pd.DataFrame({k:pd.Series(v) for k,v in salaries.items()})
c = alt.Chart(source).transform_fold(
['NL', 'ES'],
as_=['Benchmark', 'Salaries']
).transform_joinaggregate(
total='count(*)',
groupby=['Benchmark']
).transform_calculate(
pct='1/ datum.total'
).mark_bar(opacity=0.3, binSpacing=0
).encode(
alt.Color('Benchmark:N'),
x=alt.X('Salaries:Q', bin=alt.Bin(maxbins=20)),
y=alt.Y('sum(pct):Q', axis=alt.Axis(format='%'), stack=None)
)
which results in:
However, I'd like the percentage to be applicable to each category instead of the total. So, in this example, on the Y axis the second distribution should show percentages in the same level as the first one, as they are identical normal distributions.
I hope it's clear enough, apologies for probably lacking the statistical theory and glossary to explain things better.
Upvotes: 1
Views: 668
Reputation: 942
It is grouped per category but the problem here is that your 'ES' column has 450 nan values, which are still counted in the count()
I guess, so your % for the actual values is very low.
One way to solve this is to use alt.Chart(source.dropna())
. which would yield the plot below.
Upvotes: 2