Kaizah Kaiser
Kaizah Kaiser

Reputation: 15

Crossfilter on aggregated results

I am using Crossfilter (DC.JS and hence D3) to visualize large volumes of data. I like the interactive nature of the library, but my data is quickly becoming far too large. The best way I see fit to deal with this, is to pre-aggregate my data if it is too large. I am having difficulties finding out how (and if) Crossfilter can work with this sort of data.

To illustrate, the data I have is of the form

[
    {"date":"01-01-2016","food": "apple", "gender": "M", "country": "DE"},
    {"date":"01-01-2016","food": "pear", "gender": "M", "country": "DE"},
    {"date":"01-01-2016","food": "apple", "gender": "F", "country": "DE"},
    {"date":"01-01-2016","food": "apple", "gender": "F", "country": "UK"},
    {"date":"01-02-2016","food": "pear", "gender": "M", "country": "UK"},
    {"date":"01-02-2016","food": "pear", "gender": "M", "country": "UK"},
    {"date":"01-02-2016","food": "apple", "gender": "M", "country": "US"},
    ...
]

How would I go by visualizing this pivoting around the date field? So knowing that on 01-01, I had 3 people buying apples, 2 from DE (1 male, 1 female) and 1 from UK, for example?

I figured I could do this by computing a sort of data-cube for each combination and then counting it, like so:

[
    {"date":"01-01-2016","food": "apple", "gender": "M", "country": "DE", "count": 100000},
    {"date":"01-01-2016","food": "pear", "gender": "M", "country": "DE", "count": 72651},
    {"date":"01-01-2016","food": "apple", "gender": "F", "country": "DE", "count": 12345},
    {"date":"01-01-2016","food": "apple", "gender": "F", "country": "UK", "count": 9287164},
    {"date":"01-02-2016","food": "pear", "gender": "M", "country": "UK", "count": 291732743},
    {"date":"01-02-2016","food": "apple", "gender": "M", "country": "US", "count": 128176376}
    ...
]

But with this setup, I don't win much on the amount of data and I am not fully sure how/if Crossfilter can handle data represented is this manner.

Upvotes: 1

Views: 305

Answers (1)

Ethan Jewett
Ethan Jewett

Reputation: 6010

This question is pretty broad*, but here goes.

There are several ways to handle this problem in Crossfilter. I'll list them more or less in order of complexity:

  1. Shrink you records by using tokens for keys and values. For example, {"date":"01-01-2016","food": "apple", "gender": "M", "country": "DE"} might become {"d":"01-01-2016","f": "a", "g": "M", "c": "DE"}, which will save you several bytes per record.
  2. Pre-aggregate your records as you have described. For counts, this is pretty easy. You pre-aggregate as you've described with the count of the number of pre-aggregation records, then you use crossfilter.group.reduceSum(function(d) { return d.count; }) or similar to aggregate a sum of the counts. For other types of aggregations it gets more complicated and may require custom reducers, but generally something will be possible. If you are having trouble with a specific aggregation problem, then create a new question about that and lay out the problem exactly.
  3. Just drive a Crossfilter-based API from the server-side. You'll lose some interactivity, but this is a solid approach. Several solutions are documented here: https://github.com/dc-js/dc.js/wiki/FAQ#how-do-i-replace-crossfilter-with-a-server-side-solution
  4. Use a combined approach where you do some pre-aggregation server-side, but still handle filtering and final aggregations client-side. The only example of doing this with Crossfilter that I'm aware of is here: http://lcadata.info (source-code here: https://github.com/esjewett/lcadata). This is a really data-dependent solution and there is no general-purpose library here.

Orthogonal to all of this is moving Crossfilter to a web-worker, which can help with interactivity but doesn't truly help with data volume struggles.

My recommendation: Do #1 above and determine how many records you can support with the level of interactivity that you need. Then, if necessary, implement #2. If that's not enough, decide if #3 is an option and if it is, do that. Otherwise consider #4, but understand that you are undertaking a pretty advanced task and will be blazing your own trail to a large extent.

*In order to answer any specific question you have, we would need further information, like exactly how many records you are trying to load of the type you show, what dimension you actually need, what type of groups you will need to create, etc.

Upvotes: 1

Related Questions