Reputation: 305
So I know Cloudant is based on couchdb. In one of my views that I get returned, I get a list of a lot of rows ex:
{"rows":[
{"key":[2015,10,7,"one"],"value":2},
{"key":[2015,10,7,"two"],"value":1},
{"key":[2015,10,7,"three"],"value":2}
....
]}
The above solution worked and was originally proposed here. However, now my data set is growing quite significantly & the # of rows can be 20k.
With the return object, there of course is the "count" for the # of rows. Rather than returning all of that data, I was hoping to run the output of this view through a list function as mentioned in couchdb here.
So I guess a few questions:
Thanks!
Upvotes: 0
Views: 459
Reputation: 1743
I'm not sure I understand your question. But if you just want to obtain the total number of rows in your view, without returning any data at all, you can query your view with limit=0
as an argument.
E.g.:
http://examples.cloudant.com/simplegeo_places/_all_docs?limit=0
Lets you find out the simplegeo_places
test database has 21.7 million documents:
{"total_rows":21735117,"offset":0,"rows":[
]}
Note that total_rows
is the total number of rows in your view, not the number of rows that would be returned, had you not specified limit=0
.
PS: Yes, Cloudant does support list functions, and you could use the head
parameter to access total_rows
.
Upvotes: 2
Reputation: 191
This is an instance of the count-distinct problem. The naive solution for this doesn't scale. But as long as your compute resources are greater than your data size, you can eventually make an exact calculation.
The _list function will probably not give you any gains, but I suppose you could just try it. The _list function must still wait for all of the results from the view to be collected before your function executes to start counting uniques.
Alternatively, while your data size is still relatively small and if it will tend to be small, you could consider warehousing your Cloudant data to dashDB and using an SQL select statement (although there will still be significant time to compute this).
After that, the options could be to use Bluemix Spark Service to run the second reduce, or even better, use a HyperLogLog library/algorithm to make an accurate and timely estimate if your distinct count starts to get really large.
Upvotes: 0