Chris Dutrow
Chris Dutrow

Reputation: 50362

Estimating number of results in Google App Engine Query

I'm attempting to estimate the total amount of results for app engine queries that will return large amounts of results.

In order to do this, I assigned a random floating point number between 0 and 1 to every entity. Then I executed the query for which I wanted to estimate the total results with the following 3 settings:

 * I ordered by the random numbers that I had assigned in ascending order
 * I set the offset to 1000
 * I fetched only one entity

I then plugged the entities's random value that I had assigned for this purpose into the following equation to estimate the total results (since I used 1000 as the offset above, the value of OFFSET would be 1000 in this case):

1 / RANDOM * OFFSET

The idea is that since each entity has a random number assigned to it, and I am sorting by that random number, the entity's random number assignment should be proportionate to the beginning and end of the results with respect to its offset (in this case, 1000).

The problem I am having is that the results I am getting are giving me low estimates. And the estimates are lower, the lower the offset. I had anticipated that the lower the offset that I used, the less accurate the estimate should be, but I thought that the margin of error would be both above and below the actual number of results.

Below is a chart demonstrating what I am talking about. As you can see, the predictions get more consistent (accurate) as the offset increases from 1000 to 5000. But then the predictions predictably follow a 4 part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).

Am I making a mistake here, or does the standard python random number generator not distribute numbers evenly enough for this purpose?

Thanks!

enter image description here

Edit:

It turns out that this problem is due to my mistake. In another part of the program, I was grabbing entities from the beginning of the series, doing an operation, then re-assigning the random number. This resulted in a denser distribution of random numbers towards the end.

I did a little more digging into this concept, fixed the problem, and tried it again on a different query (so the number of results are different from above). I found that this idea can be used to estimate the total results for a query. One thing of note is that the "error" is very similar for offsets that are close by. When I did a scatter chart in excel, I expected the accuracy of the predictions at each offset to "cloud". Meaning that offsets at the very begging would produce a larger, less dense cloud that would converge to a very tiny, dense could around the actual value as the offsets got larger. This is not what happened as you can see below in the cart of how far off the predictions were at each offset. Where I thought there would be a cloud of dots, there is a line instead.

enter image description here

This is a chart of the maximum after each offset. For example the maximum error for any offset after 10000 was less than 1%:

enter image description here

Upvotes: 4

Views: 314

Answers (3)

lucemia
lucemia

Reputation: 6617

Some quick thought:

  1. Have you tried Datastore Statistics API? It may provide a fast and accurate results if you won't update your entities set very frequently. http://code.google.com/appengine/docs/python/datastore/stats.html

    [EDIT1.]

  2. I did some math things, I think the estimate method you purposed here, could be rephrased as an "Order statistic" problem. http://en.wikipedia.org/wiki/Order_statistic#The_order_statistics_of_the_uniform_distribution

For example:

If the actual entities number is 60000, the question equals to "what's the probability that your 1000th [2000th, 3000th, .... ] sample falling in the interval [l,u]; therefore, the estimated total entities number based on this sample, will have an acceptable error to 60000."

If the acceptable error is 5%, the interval [l, u] will be [0.015873015873015872, 0.017543859649122806] I think the probability won't be very large.

Upvotes: 1

Sudhir Jonathan
Sudhir Jonathan

Reputation: 17516

When using GAE it makes a lot more sense not to try to do large amounts work on reads - it's built and optimized for very fast requests turnarounds. In this case it's actually more efficent to maintain a count of your results as and when you create the entities.

If you have a standard query, this is fairly easy - just use a sharded counter when creating the entities. You can seed this using a map reduce job to get the initial count.

If you have queries that might be dynamic, this is more difficult. If you know the range of possible queries that you might perform, you'd want to create a counter for each query that might run.

If the range of possible queries is infinite, you might want to think of aggregating counters or using them in more creative ways.

If you tell us the query you're trying to run, there might be someone who has a better idea.

Upvotes: 2

oli
oli

Reputation: 3551

This doesn't directly deal with the calculations aspect of your question, but would using the count attribute of a query object work for you? Or have you tried that out and it's not suitable? As per the docs, it's only slightly faster than retrieving all of the data, but on the plus side it would give you the actual number of results.

http://code.google.com/appengine/docs/python/datastore/queryclass.html#Query_count

Upvotes: 1

Related Questions