How to update 400,000 GAE datastore entities in parallel?

I have 400,000 entities of a certain type, and I'd like to perform a simple operation on each of them (adding a property). I can't process them serially because it would take forever. I don't want to use the MapReduce library because it is complicated and overwhelming.

Basically I'd like to create 100 tasks on the taskqueue, each task taking a segment of ~4,000 entities and performing this operation on each one. Hopefully this wouldn't take more than a few minutes to process all 400k entities, when all tasks are executing in parallel.

However, I'm not sure how to use GAE queries to do this. My entities have string ID's of the form "230498234-com.example" which were generated by my application. I want each task to basically ask the datastore something like, "Please give me entities #200,000-#204,000" and then operate on them one by one.

Is this possible? How can I divide up the datastore in this way?

Upvotes: 2

Answers (4)

Zig Mandel

Reputation: 19835

Reading is fast, writting is slow. Unless you can do efficient queries to segment the data (hint: dont do it with offset pagination as appengine will walk the index all the way to your page for each page, use query cursors instead), have a single backend do a single query and send the data to be processed to task queues. Each can process 100 for example. The advantage here is that you dont need to segment your data and dont need any complicated setup other than starting a single backend that creates the task queues as it reads from the single query. the new appengine modules might be easier (because they wont randomly stop) than the standard backend instances.

If you want to make it really robust, use a query cursor with pagesize = elements to process per task queue and remember the last cursor that you created a task queue. In case the backend stops before it finishes start it again and it will pick up where it was stopped.

Upvotes: 4

Niklas Rosencrantz

Reputation: 26652

My first use of mapreduce was something almost exactly like this. I had to add a property to my image models and I did it like this:

In mapreduce.yaml

mapreduce:
- name: cleanimages
  mapper:
    input_reader: mapreduce.input_readers.DatastoreInputReader
    handler: main.process
    params:
    - name: entity_kind
      default: main.Image

Then what you want to happen, you put in the process code:

def process(entity):
    entity.small = None
    yield op.db.Put(entity)

In this case I just set one of the variables to None since it was no longer used, but you can put any code there you like, creating a new property and saving the entity like abose. You can find more info at the mapreduce site.

Upvotes: 0

Mars

Reputation: 1422

This is a perfect job for MapReduce (https://developers.google.com/appengine/docs/python/dataprocessing/). It may be difficult to learn at first but once mastered you'll fall in love with it.

You can also consider lazily adding the property when the entry is next saved, provided not having the property is the same as having the default value in your query.

Upvotes: 5

Isaac

Reputation: 788

A task master can do the query and post cursors (using 'end cursors') to a task queue, each corresponding to 1k results, rather than fetching the results. Note that there's no guarantee that the workers will see the exact same same query results when executing on the cursor, but this is probably good enough. An alternative with more guarantees would be to perform a keys-only search on the task master and actually fetch the results (the keys), and then post groups of 1000 to the task queue. Workers can use a multiget to retrieve items with stronger consistency guarantees.

Upvotes: 0

How to update 400,000 GAE datastore entities in parallel?

Answers (4)

Related Questions