filter and sort based on an aggregate with Cloudant/CouchDB chained map reduce

Question

I would like to filter a list and sort it based on an aggregate; something that is fairly simple to express in SQL, but I'm puzzled about the best way to do it with iterative Map Reduce. I'm specifically using Cloudant's "dbcopy" addition to CouchDB, but I think the approach might be similar with other map/reduce architectures.

Pseudocode SQL might look like so:

SELECT   grouping_field, aggregate(*)
FROM     data
WHERE    #{filter}
GROUP BY grouping_field
ORDER BY aggregate(*), grouping_field
LIMIT    page_size

The filter might be looking for a match or it might searching within a range; e.g. field in ('foo', 'bar') or field between 37 and 42.

As a concrete example, consider a dataset of emails; the grouping field might be "List-id", "Sender", or "Subject"; the aggregate function might be count(*), or max(date) or min(date); and the filter clause might consider flags, a date range, or a mailbox ID. The documents might look like so:

{
  "id": "foobar", "mailbox": "INBOX", "date": "2013-03-29",
  "sender": "foo@example.com", "subject": "Foo Bar"
}

Getting a count of emails with the same sender is trivial:

"map": "function (doc) { emit(doc.sender, null) }",
"reduce": "_count"

And Cloudant has a good example of sorting by count on the second pass of a map reduce. But when I also want to filter (e.g. by mailbox), things get messy fast.

If I add the filter to the view keys (e.g. final result looks like {"key": ["INBOX", 1234, "foo@example.com"], "value": null}, then it's trivial to sort by count within a single filter value. But sorting that data by count with multiple filters would require traversing the entire data set (per key), which is far too slow on large data sets.

Or I could create an index for each potential filter selection; e.g. final result looks like {"key": [["mbox1", "mbox2"], 1234, "foo@example.com"], "value": null}, (for when both "mbox1" and "mbox2" are selected) or {"key": [["mbox1"], 1234, "foo@example.com"], "value": {...}}, (for when only "mbox1" is selected). That's easy to query, and fast. But it seems like the disk size of the index will grow exponentially (with the number of distinct filtered fields). And it seems to be completely untenable for filtering on open-ended data, such as date ranges.

Lastly, I could dynamically generate views which handle the desired filters on the fly, only on an as-needed basis, and tear them down after they are no longer being used (to save on disk space). The downsides here are a giant jump in code complexity, and a big up-front cost every time a new filter is selected.

Is there a better way?

filter and sort based on an aggregate with Cloudant/CouchDB chained map reduce

Answers (1)

Related Questions