Vik
Vik

Reputation: 9289

Mass update queues

I see a very common requirement to update a field or value of all the rows of a particular entity type and it usually crosses 10 min queue limit.

So, what is the best way to run a cron job using task queues which can finish updation of all the rows ?

One of the approach i tried was firing a query in the cron job and then creating multiple list of ids of equally size like say each list contains 100 ids. and then spwaning one task per list by passing the id list. Then in the task code geting the entity row using

pm.getObjectId and then processing it.

I still find this approach a bit manual and not intelligent. Any better ways to handle it?

Upvotes: 1

Views: 135

Answers (2)

Peter Knego
Peter Knego

Reputation: 80340

I use this to update millions of records within task queue 10min limit:

  1. Create a loop where in each iteration you run a query with cursor (do not use offset()). In each iteration use next cursor. This way you will efficiently walk the whole range of targeted entities. Use limit(1000) to each time get a batch of 1000 entities. Also set the prefetch size to 1000 to minimize network roundtrip.

  2. For each batch, update the properties and then do async put.

Upvotes: 1

Ajax
Ajax

Reputation: 2520

If you have cash to burn, use a backend; they have no limits (though using a backend to process a single large request is wasteful... Only consider this if you have other work you can offload to it).

More likely, what you really want to do is sharding. That is, breaking up one big linear task into a bunch of smaller, parallelizable tasks.

Once common pattern I use a lot is to have one request just do dispatching... That is, query on the work you need to do, collect up a list of keys to operate on, and fire off batches of work with, say, 100 tasks at a time (send along as much data as you can scrape to avoid re-querying if you don't need to).

This way, only the dispatcher has to navigate the complete dataset, without performing any time-consuming updates, and so long as it takes less than 10 minutes, you should be golden.

Now, depending on your entity ancestor setup, you might run into contention trying to update thousands of entities in parallel (which can happen if your dispatcher is too fast). The simple solution is to set .withCountDownMillis((latency+=1000)) to give each request about a second of breathing room (maybe more, depending on the size of your entities and the number of indexes on each one). Benchmark your app w/ appstats to see how long each actually takes, and give them an extra 500 or so millis to cover standard deviation.

Now... I also have to wonder how many entities you are working on that 10 minutes isn't long enough... Are you using asynchronous api? How about batching requests? If you are operating on one entity at a time, and blocking on get/put per entity, you will easily hit the limit.

Instead, look into asynchronous requests. Using async, I am able to fire off a put, stash the Future, fire off a bunch more, and then by the time I finalize the Future, the operation is already completed, and I pay essentially 0milli wall time blocking on requests.

Even if you can't use low-level async (still, highly recommended), consider at least using batches. That is, instead of putting one at a time, use a list and do a put + clear every 50 entities or so (more if they are small). This allows appengine internal backend to parallelize all fifty, so you pay the time for 1+ per entity serialization overhead.

Combining both async and batching with non-contentious entities, I am generally able to process roughly 4000 entities a minute. And if you have to do 40,000+ entities, then you need to look into proper sharding. To do so, grab one key every (arbitrarily chosen) 1000 entities, and launch a task which queries from previous key (or null) to next key. This allows you to run over as many entities as you please in a short time by taking a big job and turning it into more smaller jobs.

Upvotes: 2

Related Questions