az5112
az5112

Reputation: 642

How to group a dataset into top-K plus others in Vega-Lite

I would like to create a stacked graph bar similar to https://vega.github.io/vega-lite/examples/stacked_bar_weather.html

My dataset contains

The number of countries is significant but I only want to show the top 5 and group the rest under the label OTHER. I want to select the top 5 over the complete time period of one year (not the top 5 per each individual day).

I am able to select the top-5 with the following top-5 transform:

  "transform" : [
    {
      "aggregate" : [ {"op": "sum", "field": "hit", "as": "hit_sum_by_country"} ],
      "groupby" : ["country"]
    },
    {
      "window": [ {"op": "row_number", "as": "rank"} ],
      "sort": [ {"field": "hit_sum_by_country", "order": "descending"} ]
    },
    {
      "calculate": "datum.rank < 5 ? datum.country : 'others'", "as" : "top_country"
    }
  ],

This, however, munges my original data.

So it looks like I would need to do the top-5 transform on the side and then somehow inject "top_country" to the original dataset. Is this possible? Or should I be doing this outside vega (i.e. in javascript or server-side)

UPDATE: an example in the vega editor. The chart is hard to read because of the number of countries that are color-coded. I would like to select top-5 and then group (sum) all others under one label/color OTHER.

UPDATE-2 another example linked from the comment

Upvotes: 1

Views: 556

Answers (1)

jakevdp
jakevdp

Reputation: 86300

I think you can accomplish what you want to do with the following transform:

  "transform": [
    {
      "joinaggregate": [{"op": "sum", "field": "hits", "as": "hit_sum_by_country"}],
      "groupby": ["country"]
    },
    {
      "window": [{"op": "row_number", "as": "rank"}],
      "sort": [{"field": "hit_sum_by_country", "order": "descending"}],
      "groupby": ["date"]
    },
    {"calculate": "datum.rank <= 5 ? datum.country : 'Other'", "as": "country"}
  ],

Note: this pre-supposes that each country has an entry at each date; if not, you'll also have to add an imputation transform before the window transform.

Here's the result (view in editor):

enter image description here

Edit: here's what the transform might look like if you need to impute missing values:


  "transform": [
    {"impute": "hits", "key": "country", "value": 0, "groupby": ["date"]},
    {
      "joinaggregate": [
        {"op": "sum", "field": "hits", "as": "hit_sum_by_country"}
      ],
      "groupby": ["country"]
    },
    {
      "window": [{"op": "row_number", "as": "rank"}],
      "sort": [{"field": "hit_sum_by_country", "order": "descending"}],
      "groupby": ["date"]
    },
    {
      "calculate": "datum.rank <= 5 ? datum.country : 'Other'",
      "as": "top_country"
    }
  ],

The result looks something like this, using the data from your updated example (view in editor):

enter image description here

Upvotes: 1

Related Questions