Dave Kincaid
Dave Kincaid

Reputation: 4190

Replicating data from GAE data store

We have an application that we're deploying on GAE. I've been tasked with coming up with options for replicating the data that we're storing the the GAE data store to a system running in Amazon's cloud.

Ideally we could do this without having to transfer the entire data store on every sync. The replication does not need to be in anything close to real time, so something like a once or twice a day sync would work just fine.

Can anyone with some experience with GAE help me out here with what the options might be? So far I've come up with:

  1. Use the Google provided bulkloader.py to export the data to CSV and somehow transfer the CSV to Amazon and process there

  2. Create a Java app that runs on GAE, reads the data from the data store and sends the data to another Java app running on Amazon.

Do those options work? What would be the gotchas with those? What other options are there?

Upvotes: 3

Views: 400

Answers (1)

proppy
proppy

Reputation: 10504

You could use a logic similar to what App Engine HRD migration or backup tool are doing:

  1. Mark modified entities with a child entity marker
  2. Run a MapperPipeline using App Engine mapreduce library iterating on those entity using a Datastore Input Reader
  3. In your map function fetch the parent entity and serialize it to Google Storage using a File Output Writer and remove the marker
  4. Ping the remote host to import those entity from the Google Storage url

As an alternative to 3 and 4, you could make multiple urlfetch(POST) to send each serialized entity to the remote host directly, but it is more fragile as an single failure could compromise the integrity of your data import.

You could look at the datastore admin source code for inspiration.

Upvotes: 5

Related Questions