Grey Panther
Grey Panther

Reputation: 13118

How to periodically update a moderate amount of data (~2.5m entries) in Google Datastore?

I'm trying to do the following periodically (lets say once a week):

Synchronization can mean that some entries are updated, others are deleted (if they were removed from the public datasets) or new entries are created.

I've put together a python script using google-cloud-datastore however the performance is abysmal - it takes around 10 hours (!) to do this. What I'm doing:

I already batch the requests (using .put_multi, .delete_multi, etc).

Some things I considered:

How could I improve the performance?

Upvotes: 1

Views: 113

Answers (2)

Jim Morrison
Jim Morrison

Reputation: 2887

If you use dataflow, instead of loading in your entire dictionary you could first import your dictionary into a new project (a clean datastore database), then in your dataflow function you could load the key given to you through dataflow to the clean project. If the value comes back from the load, upsert that to your production project, if it doesn't exist, then delete the value from your production project.

Upvotes: 0

TheAddonDepot
TheAddonDepot

Reputation: 8964

Assuming that your datastore entities are only updated during the sync, then you should be able to eliminate the "iterate over the entries from the datastore" step and instead store the entity keys directly in your dictionary. Then if there are any updates or deletes necessary, just reference the appropriate entity key stored in the dictionary.

You might be able to leverage multiple threads if you pre-generate empty entities (or keys) in advance and store cursors at a given interval (say every 100,000 entities). There's probably some overhead involved as you'll have to build a custom system to manage and track those cursors.

Upvotes: 0

Related Questions