How to minimize datastore writes initiated by the mapreduce library?

I've got 3 parts to this question:

I have an application where users create objects that other users can update within 5 minutes. After 5 minutes, the objects time out and are invalid. I'm storing the objects as entities. To do the timeout, I have a cron job that runs once a minute to clear out the expired objects.

Most of the time right now, I don't have any active objects. In this case, the mapreduce handler checks the entity it gets, and does nothing if it's not active, no writes. However, my free datastore write quota is running out from the mapreduce calls after about 7 hours. According to my rough estimate, it looks like just running mapreduce causes ~ 120 writes/call. (Rough math, 60 calls/hr * 7 hr = 420 calls, 50k ops limit / 420 calls ~ 120 writes/call)

Q1: Can anyone verify that just running mapreduce triggers ~120 datastore writes?

To get around it, I'm checking the datastore before I kick off the mapreduce:

def cronhandler():
    count = model.all(keys_only=True).count(limit=1000) 
    if count:
        shards = (count / 100) + 1;
        from mapreduce import control
        control.start_map("Timeout open objects",
                      "expire.maphandler",
                      "expire.OpenOrderInputReader",
                      {'entity_kind' : 'model'},
                      shard_count=shards)
    return HttpResponse()

Q2: Is this the best way to avoid the mapreduce-induced datastore writes? Is there a better way to configure mapreduce to avoid extraneous writes? I was thinking potentially it was possible with a better custom InputReader

Q3: I'm guessing more shards result in more extraneous datastore writes from mapreduce bookkeeping. Is limiting shards by the expected number of objects I need to write appropriately?

Upvotes: 4

Answers (3)

Kyle Finley

Reputation: 12002

This doesn't exactly answer your quesion, but could you reduced the frequency of the cron job?

Instead of deleting models as soon as they become invalid, simply remove them from the queries that your Users see.

For example:

import datetime

now = datetime.datetime.now(created_at)
five_minutes_ago = now - datetime.timedelta(minutes=5)
q = model.all()
q.filter('create_at >=', five_minutes_ago)

Or if you don't want to use an inequality filter you could use == based on five minute blocks.

Then, you run your cron every hour or so to clean out the inactive models.

The downside to this approach is the the entities would be returned by key only fetch, in which case you would need to verify that they were still valid before returning them to the user.

Upvotes: 1

dragonx

Reputation: 15143

I'm assuming what I've done is the best way to go about doing things. It looks like the Mapreduce API uses the datastore to keep track of the jobs launched and synchronize workers. By default the API uses 8 workers. Reducing the number of workers reduces the number of datastore writes, but that reduces wall time performance as well.

Upvotes: 0

rbanffy

Reputation: 2521

What if you kept your objects on memcache instead of the datastore? My only worry is whether a memcache is consistent between all instances running a given application, but, if it is, the problem has a very neat solution.

Upvotes: 2

How to minimize datastore writes initiated by the mapreduce library?

Answers (3)

Related Questions