Reputation: 50362

Estimating number of results in Google App Engine Query

I'm attempting to estimate the total amount of results for app engine queries that will return large amounts of results.

In order to do this, I assigned a random floating point number between 0 and 1 to every entity. Then I executed the query for which I wanted to estimate the total results with the following 3 settings:

 * I ordered by the random numbers that I had assigned in ascending order
 * I set the offset to 1000
 * I fetched only one entity

I then plugged the entities's random value that I had assigned for this purpose into the following equation to estimate the total results (since I used 1000 as the offset above, the value of OFFSET would be 1000 in this case):

1 / RANDOM * OFFSET

The idea is that since each entity has a random number assigned to it, and I am sorting by that random number, the entity's random number assignment should be proportionate to the beginning and end of the results with respect to its offset (in this case, 1000).

The problem I am having is that the results I am getting are giving me low estimates. And the estimates are lower, the lower the offset. I had anticipated that the lower the offset that I used, the less accurate the estimate should be, but I thought that the margin of error would be both above and below the actual number of results.

Below is a chart demonstrating what I am talking about. As you can see, the predictions get more consistent (accurate) as the offset increases from 1000 to 5000. But then the predictions predictably follow a 4 part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).

Am I making a mistake here, or does the standard python random number generator not distribute numbers evenly enough for this purpose?

Thanks!

enter image description here

Edit:

It turns out that this problem is due to my mistake. In another part of the program, I was grabbing entities from the beginning of the series, doing an operation, then re-assigning the random number. This resulted in a denser distribution of random numbers towards the end.

I did a little more digging into this concept, fixed the problem, and tried it again on a different query (so the number of results are different from above). I found that this idea can be used to estimate the total results for a query. One thing of note is that the "error" is very similar for offsets that are close by. When I did a scatter chart in excel, I expected the accuracy of the predictions at each offset to "cloud". Meaning that offsets at the very begging would produce a larger, less dense cloud that would converge to a very tiny, dense could around the actual value as the offsets got larger. This is not what happened as you can see below in the cart of how far off the predictions were at each offset. Where I thought there would be a cloud of dots, there is a line instead.

enter image description here

This is a chart of the maximum after each offset. For example the maximum error for any offset after 10000 was less than 1%:

enter image description here

Upvotes: 4

Estimating number of results in Google App Engine Query

Answers (3)

Related Questions