Reputation: 9953
I've just started messing around with CouchDB. In many places, I've seen it stated that the reduce function for a CouchDB view needs to be both associative and commutative (for example, the CouchDB wiki says this). I do understand that CouchDB can do the reduce in stages and cache and re-use intermediate results via re-reduce. It is also clear to me why reduce functions must be associative. However, I don't understand where the commutative requirement comes from.
Let me try to be a bit more clear. Let's say I have a map function that outputs 10 values when run over all the contents of my database. For the sake of simplicity, let's say those values are the integers 1 - 10 (1, 2, 3, 4, etc.).
Lets say I have a function, f(keys, values, rereduce)
, that is associative, but not commutative. Lets say that f
ignores the keys and rereduce arguments and operates only on the values array. I will therefore omit the keys and rereduce arguments in the rest of my explanation.
As I understand it, the commutative property would require that f([1, 2]) == f([2, 1])
, but I don't think CouchDB would ever do that. I do understand that it may pass me the results of previous computations of f
on a subset of the map outputs, but I believe it will always pass them to me in order. For example, I might get called, with rereduce == true
, like f([f([1, 2, 3]), f([4, 5, 6])])
, but I don't think my reduce function ever gets called like f([f([4, 5, 6]), f([1, 2, 3])])
or in any other similar way in which the array arguments aren't "ordered". If that's correct, then I do understand why my reduce function needs to be commutative.
Extra Explanation:
The entire question is above, but in case it helps somebody understand where I'm confused, here's my understanding of what the associative property requires.
To me, it would seem that f
being associative means that:
f([1, 2, 3]) == f([f([1, 2]), 3]) == f([1, f([2, 3])]) == f([f([1]), f([2, 3])]) == f([f([1, 2]), f([3])])
The 1st 3 are the standard mathematical associative definition for a binary operator. The last 2 are a slight generalization due to the fact that we're working with arrays and non-binary operators. Thus, for example, if CouchDB has cached the result of reducing [1, 2]
and the result of reducing [3, 4]
, it can call my reduce function, with rereduce == true
, passing just f([1, 2])
and f([3, 4])
to compute f([1, 2, 3, 4])
, without requiring my code to process all 4 items.
Application
For what it's worth, I have a use case where my map outputs are permutations. Since the permutation function is associative, but not commutative, I think I can use CouchDB reduce caching to do some cool stuff really efficiently. And, in my testing, it seems to work. But the wiki, the CouchDB book, and other sources say the reduce has to be commutative and I therefore shouldn't do this.
Upvotes: 2
Views: 478
Reputation: 9953
Posted this to the CouchDB mailing list and got an answer. It seems that currently, reduce functions do not need to be commutative. However, CouchDB plans to integrate BigCouch at which point data will be sharded across multiple servers. In that environment, the values passed to the reduce function can be essentially random so commutative is required.
While non-commutative functions would continue to work fine if there's only a single shard, the CouchDB team is not promising this will always work; they reserve the right to change things in the future and can only promise that your reduce will work if it is commutative.
Upvotes: 0