jar
jar

Reputation: 2908

Sorting a stacked bar chart as a whole along Y-axis, based on the value of a particular category field using Altair

Using an example from the docs, I can sort the stacked bars themselves using order, but I want to see the whole bar along Y-axis sorted via the sum of yield of site -> Crookston, i.e the blue bar, in ascending/descending order.

Based on this post I tried using transform_calculate and transform_join_aggregate, but it doesn't work as expected.

import altair as alt
from vega_datasets import data

source = data.barley()

alt.Chart(source).mark_bar().transform_calculate(
    key="datum.site == 'Crookston'"
).transform_joinaggregate(
    sort_key="argmax(key)", groupby=['variety']
).transform_calculate(
    sort_val='datum.sort_key.value'  
).encode(
    x=alt.X('sum(yield)', stack='normalize'),
    y=alt.Y('variety', sort=alt.SortField('sort_val', order='ascending')),
    color='site',
    order=alt.Order(
      # Sort the segments of the bars by this field
      'site',
      sort='ascending'
    )
)

enter image description here

Expected Output
The bars along Y-axis are sorted by the size of blue (site=Crookston) bar.

Upvotes: 1

Views: 616

Answers (2)

Sky
Sky

Reputation: 25

As demonstrated in jakevdp's answer, you can start by using a Calculate transform to define a new field that copies the yield when the site is "Crookston" and is 0 otherwise. From there, it is not necessary to perform a Join Aggregate transform; SortField will automatically sum the site yield for Crookston directly in the y-axis sort command by default.

import altair as alt
from vega_datasets import data

source = data.barley()

alt.Chart(source).mark_bar().transform_calculate(
    filtered="datum.site == 'Crookston' ? datum.yield : 0"
).encode(
    x=alt.X("sum(yield)"),
    y=alt.Y(
        "variety",
        sort=alt.SortField("filtered", order="ascending"),
    ),
    color="site",
    order=alt.Order(
        # Sort the segments of the bars by this field
        "site",
        sort="ascending",
    ),
)

Result

In fact, the previous answer's method using transform_joinaggregate does not work as expected in general, and only works in this example because the source dataset has the exact same number of records for each variety. For instance, if you add a record of the "Manchuria" variety with a yield of 0 to the Crookston site, that method will now sort Manchuria two places farther down on the y-axis, below Velvet and No. 475, and above No. 462.

source = data.barley()
source = source.append(
    {"yield": 0, "variety": "Manchuria", "site": "Crookston"},
    ignore_index=True,
)

alt.Chart(source).mark_bar().transform_calculate(
    filtered="datum.site == 'Crookston' ? datum.yield : 0"
).transform_joinaggregate(
    sort_val="sum(filtered)", groupby=["variety"]
).encode(
    x=alt.X("sum(yield)"),
    y=alt.Y("variety", sort=alt.SortField("sort_val", order="ascending")),
    color="site",
    order=alt.Order("site", sort="ascending"),
)

Result

It's visually apparent that the chart is no longer sorted as desired. Adding a yield of zero should not have affected the sort order; the Manchuria variety still has a smaller yield in Crookston than the Velvet and No. 475 varieties.

To see what went wrong, you can open the chart produced by the second code block in Vega Editor. There you will find a table called "data_0" with entries including the following (not in this order):

yield variety year site filtered sort_val
39.93333 "Manchuria" 1931 "Crookston" 39.93333 72.9
32.96667 "Manchuria" 1932 "Crookston" 32.96667 72.9
0 "Manchuria" null "Crookston" 0 72.9
22.56667 "Manchuria" 1932 "Duluth" 0 72.9
41.33333 "Velvet" 1931 "Crookston" 41.33333 73.39999
32.06666 "Velvet" 1932 "Crookston" 32.06666 73.39999
22.46667 "Velvet" 1932 "Duluth" 0 73.39999
48.56666 "No. 462" 1931 "Crookston" 48.56666 79.09999

The sort_vals for Manchuria of 72.9 are less than those for Velvet, as they should be. However, Vega still needs to determine how to aggregate the duplicate values of sort_val that appear in each row for a given variety. The default behavior for stacked plots is to sum all of the entries in the sort field across the group it is trying to sort (see: https://vega.github.io/vega-lite/docs/sort.html#sort-by-a-different-field), a fact that came in handy in the first code block.

The source data set had 12 entries for each variety to begin. After adding a record, there are now 13 entries of the Manchuria variety, so Manchuria gets a sort value of 72.9 · 13 = 947.7, which is larger than Velvet's sort value of 73.39999 · 12 ≈ 880.8, but still smaller than variety No. 462's sort value of 79.09999 · 12 ≈ 949.2. This reflects what was seen in the second chart.

To fix this, you can specify that only a single sort_val should be used as the sorting value for each variety, by using EncodingSortField instead of SortField, and passing "min", "max", or "average" as the aggregation operation to the op parameter, e.g. sort=alt.EncodingSortField("sort_val", op="min", order="ascending"). Or you can use the first method above and skip the Join Aggregate transform.

Upvotes: 0

jakevdp
jakevdp

Reputation: 86300

Each colored bar in your chart represents the sum of all yields within that site and variety, for all years in the dataset. When you use argmax, you are sorting by a single year's Crookston yield, not the total Crookston yield among all years. You can get the latter with a slightly different transform strategy:

import altair as alt
from vega_datasets import data

source = data.barley()

alt.Chart(source).mark_bar().transform_calculate(
    filtered="datum.site == 'Crookston' ? datum.yield : 0"
).transform_joinaggregate(
    sort_val="sum(filtered)", groupby=["variety"]
).encode(
    x=alt.X('sum(yield)', stack='normalize'),
    y=alt.Y('variety', sort=alt.SortField('sort_val', order='ascending')),
    color='site',
    order=alt.Order(
      # Sort the segments of the bars by this field
      'site',
      sort='ascending'
    )
)

enter image description here

The result is correctly sorted by the total yield from Crookston, as you can confirm by removing the normalization in the x encoding:

alt.Chart(source).mark_bar().transform_calculate(
    filtered="datum.site == 'Crookston' ? datum.yield : 0"
).transform_joinaggregate(
    sort_val="sum(filtered)", groupby=["variety"]
).encode(
    x=alt.X('sum(yield)'),
    y=alt.Y('variety', sort=alt.SortField('sort_val', order='ascending')),
    color='site',
    order=alt.Order(
      'site',
      sort='ascending'
    )
)

enter image description here

Upvotes: 1

Related Questions