Reputation: 562
I am writing an application that uses a remote API that serves up a fairly static data (but still can update several times a day). The problem is that the API is quite slow, and I'd much rather import that data into my own datastore anyway, so that I can actually query the data on my end as well.
The problem is that the results contain ~700 records that need to be sync'd every 5 hours or so. This involves adding new records, updating old records and deleting stale ones.
I have a simple solution that works -- but it's slow as molasses, and uses 30,000 datastore read operations before it times out (after about 500 records).
The worst part about this is that the 700 records are for a single client, and I was doing it as a test. In reality, I would want to do the same thing for hundreds or thousands of clients with a similar number of records... you can see how that is not going to scale.
Here is my entity class definition:
class Group(ndb.Model):
groupid = ndb.StringProperty(required=True)
name = ndb.StringProperty(required=True)
date_created = ndb.DateTimeProperty(required=True, auto_now_add=True)
last_updated = ndb.DateTimeProperty(required=True, auto_now=True)
Here is my sync code (Python):
currentTime = datetime.now()
groups = get_list_of_groups_from_api(clientid) #[{'groupname':'Group Name','id':'12341235'}, ...]
for group in groups:
groupid = group["id"]
groupObj = Group.get_or_insert(groupid, groupid=group["id"], name=group["name"])
groupObj.put()
staleGroups = Group.query(Group.last_updated < currentTime)
for staleGroup in staleGroups:
staleGroup.delete()
Upvotes: 0
Views: 158
Reputation: 12986
I can't tell you why you are getting 30,000 read operations.
You should start by running appstats and profiling this code, to see where the datastore operations are being performed.
That being said I can see some real inefficiencies in your code.
For instance your delete stale groups code is horribly inefficient.
You should be doing a keys_only query, and then doing batch deletes. What you are doing is really slow with lots of latency for each delete() in the loop.
Also get_or_insert uses a transaction (also if the group didn't exist a put is already done, and then you do a second put()) , and if you don't need transactions you will find things will run faster. The fact that you are not storing any additional data means you could just blind write the groups (So initial get/read), unless you want to preserve date_created
.
Other ways of making this faster would be by doing batch gets/puts on the list of keys. Then for all the entities that didn't exist, do a batch put()
Again this would be much faster than iterating over each key.
In addition you should use a TaskQueue to run this set of code, you then have a 10 min processing window.
After that further scaling can be achieved by splitting the process into two tasks. The first creates/updates the group entities. Once that completes you start the task that deletes stale groups - passing the datetime as an argument to the next task.
If you have even more entities than can be processed in this simple model then start looking at MapReduce.
But for starters just concentrate on making the job you are currently running more efficient.
Upvotes: 1