John Earnshaw
John Earnshaw

Reputation: 331

Counting couchdb rows unique only

I have a db in couch with 55,000,000 docs. Many of the docs have duplicate values for certain properties and I would like to get a count of only unique values for a property.

I'm new to couchdb and saw list function but this is far too slow for iterating over 55 million rows and times out.

If I do:

"map": "function(doc) { if (doc.property) { emit(doc.property, 1); } }" "reduce": "_count"

and then group, I get the total count of property including duplicates. How can I get this reduced to uniques only?

Thanks.

Upvotes: 2

Views: 633

Answers (4)

Dan Zuzevich
Dan Zuzevich

Reputation: 3831

Using newer JavaScript features, you can make use of a Set, which only allows a value to appear once. This example uses getting all of the unique fruits listed in a database. There is also no need to emit a value in the map function.

Example Document Layout:

{
    "type": "fruits",
    "item": "orange",
    ...whatever else
}

Map:

function (doc) {
  if(doc.type === 'fruits') {
    emit(doc.item, null)
  }
}

Reduce:

function (keys, values) {
  const fruits = new Set()
  
  for(const key of keys) 
    if(!fruits.has(key)) fruits.add(key)
    
  return fruits
}

Upvotes: 0

Jonathan Wylie
Jonathan Wylie

Reputation: 385

I hope no one is using the accepted answer from Mariusz here because it doesn't work, at least in couchDB

CouchDB reduce functions also need to perform rereduces. That is reducing the output of several other reduces.

Typical solution Make your map function output a unique key, and then just reduce with _count. Exactly what you suggested in your question, except group=true. This will count how many instances of each unique thing you have. Each row will represent a unique thing. You can easily count the total rows in a list function.

Alternatively You may not wish to make the key unique eg you might have time series data, and wish to query the unique values within a certain time range, then you have to include the datetime in the key. To handle this case is tricky.

Option 1: The naive solution is to not count the unique values but just make a big list of unique values a bit like this, and then count them all in the client, or in a list function afterwards.

function (keys, values, rereduce) {

    var unique = {};

    var getUniqueValues = function(values) {
        for (i = 0; i < values.length; i++) {
            if (values[i] in unique) {
            } else {
                unique[values[i]] = null;
            }
        }
    }

    if (rereduce === true) {
        for (j = 0; j < values.length; j++) {
            getUniqueValues(values[j]);
        };
        return Object.keys(unique);
    } else {
        getUniqueValues(values);
        return Object.keys(unique);
    }

}

Option 2: Another option is not reducing at all, and just counting the unique values in a list function. As you say, this can get slow when there are lots of values.

Option 3: To avoid using excessive memory when counting a large number of unique things is tricky. It can be done by hashing the unique value to a bit on a bit map. Then counting how many 1's there are in the final bitmap.

This also lets you use a reduce function, because you can combine bitmaps to combine your unique results. Then finally in the client or in list function count the 1's in the bitmap.

I haven't tried this in couchdb yet, but the theory is sound: http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html

The one caveat is that there may be a small error if the bitmap is not large enough. However when you are counting very large amounts, a small error is often acceptable.

Upvotes: 1

John Earnshaw
John Earnshaw

Reputation: 331

function(keys, values) {
  var result = [];
  keys.forEach(function(key) {
      if (result.indexOf(key[0]) == -1) {
          result.push(key[0]);
      }
  });

  return result.length;
}

Upvotes: 0

Mariusz
Mariusz

Reputation: 481

Your map function is OK - you can't do better here. Let's focus on reduce.

function(keys, values) {
  var result = {};
  var counter = 0;
  keys.forEach(function(key) { 
    if (!result[key]) {
      result[key] = true; // or whatever
      counter++;
    }
  });

  return counter;
}

Upvotes: 1

Related Questions