nicholas a. evans
nicholas a. evans

Reputation: 2233

filter and sort based on an aggregate with Cloudant/CouchDB chained map reduce

I would like to filter a list and sort it based on an aggregate; something that is fairly simple to express in SQL, but I'm puzzled about the best way to do it with iterative Map Reduce. I'm specifically using Cloudant's "dbcopy" addition to CouchDB, but I think the approach might be similar with other map/reduce architectures.

Pseudocode SQL might look like so:

SELECT   grouping_field, aggregate(*)
FROM     data
WHERE    #{filter}
GROUP BY grouping_field
ORDER BY aggregate(*), grouping_field
LIMIT    page_size

The filter might be looking for a match or it might searching within a range; e.g. field in ('foo', 'bar') or field between 37 and 42.

As a concrete example, consider a dataset of emails; the grouping field might be "List-id", "Sender", or "Subject"; the aggregate function might be count(*), or max(date) or min(date); and the filter clause might consider flags, a date range, or a mailbox ID. The documents might look like so:

{
  "id": "foobar", "mailbox": "INBOX", "date": "2013-03-29",
  "sender": "[email protected]", "subject": "Foo Bar"
}

Getting a count of emails with the same sender is trivial:

"map": "function (doc) { emit(doc.sender, null) }",
"reduce": "_count"

And Cloudant has a good example of sorting by count on the second pass of a map reduce. But when I also want to filter (e.g. by mailbox), things get messy fast.

If I add the filter to the view keys (e.g. final result looks like {"key": ["INBOX", 1234, "[email protected]"], "value": null}, then it's trivial to sort by count within a single filter value. But sorting that data by count with multiple filters would require traversing the entire data set (per key), which is far too slow on large data sets.

Or I could create an index for each potential filter selection; e.g. final result looks like {"key": [["mbox1", "mbox2"], 1234, "[email protected]"], "value": null}, (for when both "mbox1" and "mbox2" are selected) or {"key": [["mbox1"], 1234, "[email protected]"], "value": {...}}, (for when only "mbox1" is selected). That's easy to query, and fast. But it seems like the disk size of the index will grow exponentially (with the number of distinct filtered fields). And it seems to be completely untenable for filtering on open-ended data, such as date ranges.

Lastly, I could dynamically generate views which handle the desired filters on the fly, only on an as-needed basis, and tear them down after they are no longer being used (to save on disk space). The downsides here are a giant jump in code complexity, and a big up-front cost every time a new filter is selected.

Is there a better way?

Upvotes: 1

Views: 2593

Answers (1)

Mike Miller
Mike Miller

Reputation: 136

I've been thinking about this for nearly a day and I think that there is no better way to do this than what you have proposed. The challenges that you face are the following:

1) The aggregation work (count, sum, etc) can only be done in the CouchDB/Cloudant API via the materialized view engine (mapreduce).

2) While the group_level API provides some flexibility to specify variable granularity at query time, it isn't sufficiently flexible for arbitrary boolean queries.

3) Arbitrary boolean queries are possible in the Cloudant API via the lucene based _search API. However, the _search API doesn't support aggregation post query. Limited support for what you want to do is only capable in lucene using faceting, which isn't yet supported in Cloudant. Even then, I believe it may only support count and not sum or more complex aggregations.

I think the best option you face is to use the _search API and make use of sort, group_by, or group_sort and then do aggregation on the client. A few sample URLs to test would look like:

GET /db/_design/ddoc/_search/indexname?q=name:mike AND age:[1.2 TO 4.5]&sort=["age","name"]

GET /db/_design/ddoc/_search/indexname?q=name:mike AND group_by="mailbox" AND group_sort=["age","name"]

Upvotes: 0

Related Questions