How to plot a pre-binned histogram with median line in Altair?

Question

This is probably more of a data processing problem than Altair-specific. But I have some survey data where respondents chose an age range rather than their actual age and I'm trying to make a histogram with a median line. Altair examples with mean lines all seem to do the binning on the fly and I'm not sure how to work around that.

Since the x-axis is categorical (I think) I can't just add a line somewhere in between. Maybe I need to convert the groups to something numerical?

Here's what I have so far

sample = ['35-39', '25-29', '30-34', '30-34', '25-29', '30-34', '22-24',
          '50-54', '30-34', '40-44', '22-24', '25-29', '22-24', '50-54',
          '22-24', '35-39', '25-29', '22-24', '22-24', '25-29', '25-29',
          '30-34', '22-24', '40-44', '30-34', '25-29', '30-34', '25-29']
data = pd.DataFrame({'age': sample})
data

bars = alt.Chart(data).mark_bar().encode(
    x=alt.X('age'),
    y='count():Q'
)

mean = alt.Chart(data).mark_rule().encode(
    x = 'mean(age)',
    size=alt.value(5)
)

bars+mean

That code produces this chart:

jakevdp · Accepted Answer

If you want to compute the mean of the x values, you'll need to specify quantitative values: computing the mean of strings, even if those strings happen to include digits, is not well defined. For your data, you could use a Calculate Transform to do something like this:

import altair as alt
import pandas as pd
sample = ['35-39', '25-29', '30-34', '30-34', '25-29', '30-34', '22-24',
          '50-54', '30-34', '40-44', '22-24', '25-29', '22-24', '50-54',
          '22-24', '35-39', '25-29', '22-24', '22-24', '25-29', '25-29',
          '30-34', '22-24', '40-44', '30-34', '25-29', '30-34', '25-29']
data = pd.DataFrame({'age': sample})

base = alt.Chart(data).transform_calculate(
    age_min='parseInt(split(datum.age, "-")[0])',
    age_max='parseInt(split(datum.age, "-")[1]) + 1',
    age_mid='(datum.age_min + datum.age_max) / 2',
)

bars = base.mark_bar().encode(
    x=alt.X('age_min:Q', bin='binned'),
    x2='age_max:Q',
    y='count():Q'
)

mean = base.mark_rule(size=5).encode(
    x = 'mean(age_mid):Q',
)

bars+mean

Note that this mean is just an approximation: there is not enough information in your binned data to compute the actual mean age, but the mean of the midpoints of each bin is the best estimate of the true value.

How to plot a pre-binned histogram with median line in Altair?

Answers (1)

Related Questions