Using QuerySplitter in Google Datastore to load chunks of a known size

Question

I'd like to load lots of data from a Google Datastore table. For performance, I'd like to run, in parallel, a few queries that each loads a lot of objects. Cursors are not suitable for the parallel execution.

QuerySplitter is. However, for QuerySplitter you have to tell it how many splits you want, and what I care about is loading a certain number of objects. The number is chosen for the needs of my application, large but not not too large, say 800 objects. It's OK if the number of objects returned by each query is only very roughly the same; nothing worse would happen that different threads running different amounts of time.

How do I do this? I could query all objects keys-only in order to count them, and divide by 800. Is there a better way.

tx802 · Accepted Answer

Querying all your entities (even keys only) might not scale so well, but you could run your query/ies periodically and save the counts in datastore or memcache, depending on how frequently you need to run your job.

However, to find all the entities of a given kind you can use the Datastore Statistics API which should be a lot quicker. I don't know how frequently the stats are updated but it's probably the same as the stats in the console.

If you are going to more frequent counts, or figures for filtered queries, you might consider sharded counters. Since you only need an approximate number, you could update them asynchronously on each new put.

Using QuerySplitter in Google Datastore to load chunks of a known size

Answers (1)

Related Questions