Reputation: 1170
This is probably more of a data processing problem than Altair-specific. But I have some survey data where respondents chose an age range rather than their actual age and I'm trying to make a histogram with a median line. Altair examples with mean lines all seem to do the binning on the fly and I'm not sure how to work around that.
Since the x-axis is categorical (I think) I can't just add a line somewhere in between. Maybe I need to convert the groups to something numerical?
Here's what I have so far
sample = ['35-39', '25-29', '30-34', '30-34', '25-29', '30-34', '22-24',
'50-54', '30-34', '40-44', '22-24', '25-29', '22-24', '50-54',
'22-24', '35-39', '25-29', '22-24', '22-24', '25-29', '25-29',
'30-34', '22-24', '40-44', '30-34', '25-29', '30-34', '25-29']
data = pd.DataFrame({'age': sample})
data
bars = alt.Chart(data).mark_bar().encode(
x=alt.X('age'),
y='count():Q'
)
mean = alt.Chart(data).mark_rule().encode(
x = 'mean(age)',
size=alt.value(5)
)
bars+mean
That code produces this chart:
Upvotes: 1
Views: 754
Reputation: 86328
If you want to compute the mean of the x values, you'll need to specify quantitative values: computing the mean of strings, even if those strings happen to include digits, is not well defined. For your data, you could use a Calculate Transform to do something like this:
import altair as alt
import pandas as pd
sample = ['35-39', '25-29', '30-34', '30-34', '25-29', '30-34', '22-24',
'50-54', '30-34', '40-44', '22-24', '25-29', '22-24', '50-54',
'22-24', '35-39', '25-29', '22-24', '22-24', '25-29', '25-29',
'30-34', '22-24', '40-44', '30-34', '25-29', '30-34', '25-29']
data = pd.DataFrame({'age': sample})
base = alt.Chart(data).transform_calculate(
age_min='parseInt(split(datum.age, "-")[0])',
age_max='parseInt(split(datum.age, "-")[1]) + 1',
age_mid='(datum.age_min + datum.age_max) / 2',
)
bars = base.mark_bar().encode(
x=alt.X('age_min:Q', bin='binned'),
x2='age_max:Q',
y='count():Q'
)
mean = base.mark_rule(size=5).encode(
x = 'mean(age_mid):Q',
)
bars+mean
Note that this mean is just an approximation: there is not enough information in your binned data to compute the actual mean age, but the mean of the midpoints of each bin is the best estimate of the true value.
Upvotes: 1