Deftness
Deftness

Reputation: 305

Cloudant list function or separate reduce

So I know Cloudant is based on couchdb. In one of my views that I get returned, I get a list of a lot of rows ex:

{"rows":[
  {"key":[2015,10,7,"one"],"value":2},
  {"key":[2015,10,7,"two"],"value":1},
  {"key":[2015,10,7,"three"],"value":2}
  ....
]}

The above solution worked and was originally proposed here. However, now my data set is growing quite significantly & the # of rows can be 20k.

With the return object, there of course is the "count" for the # of rows. Rather than returning all of that data, I was hoping to run the output of this view through a list function as mentioned in couchdb here.

So I guess a few questions:

  1. Has anyone used the _list functionality in cloudant?
  2. Alternatively, would someone know a reduce & re-reduce function that would just give me the length of rows (i.e. # of keys?) Otherwise it takes way too long to return all of the data just to get a simple count of rows.

Thanks!

Upvotes: 0

Views: 459

Answers (2)

Nuno Cruces
Nuno Cruces

Reputation: 1743

I'm not sure I understand your question. But if you just want to obtain the total number of rows in your view, without returning any data at all, you can query your view with limit=0 as an argument.

E.g.:

http://examples.cloudant.com/simplegeo_places/_all_docs?limit=0

Lets you find out the simplegeo_places test database has 21.7 million documents:

{"total_rows":21735117,"offset":0,"rows":[

]}

Note that total_rows is the total number of rows in your view, not the number of rows that would be returned, had you not specified limit=0.


PS: Yes, Cloudant does support list functions, and you could use the head parameter to access total_rows.

Upvotes: 2

gadamcox
gadamcox

Reputation: 191

This is an instance of the count-distinct problem. The naive solution for this doesn't scale. But as long as your compute resources are greater than your data size, you can eventually make an exact calculation.

The _list function will probably not give you any gains, but I suppose you could just try it. The _list function must still wait for all of the results from the view to be collected before your function executes to start counting uniques.

Alternatively, while your data size is still relatively small and if it will tend to be small, you could consider warehousing your Cloudant data to dashDB and using an SQL select statement (although there will still be significant time to compute this).

After that, the options could be to use Bluemix Spark Service to run the second reduce, or even better, use a HyperLogLog library/algorithm to make an accurate and timely estimate if your distinct count starts to get really large.

Upvotes: 0

Related Questions