FindBoat
FindBoat

Reputation: 591

Suggested way to process google cloud sql

The task I need to run is to process all data in my google cloud sql and copy files from blobstore to google cloud storage. I'm considering using task queue, but it has a 10min time limit but my task must be longer than that. I'm not sure if mapreduce can be used since I don't need to run things parallely and also I haven't found any mapreduce support for cloud sql. So what's the suggested way to do this kind of data processing on app engine, especially for cloud sql?

Upvotes: 0

Views: 222

Answers (2)

Bugs
Bugs

Reputation: 1452

Option 1

Have each MySQL data row processed by a separate task. This way, you won't need to worry about the processing time, because you'll have 10 minutes for each row.

This should be pretty straightforward if you have sequential IDs in your tables. Otherwise, you may just create them by adding an auto_increment column.

Your task could look something like this (pseudocode):

function enqueueTask(int rowId)

    # id >= rowId to cater to deleted rows
    row = query('SELECT ... WHERE id >= rowId ORDER BY rowId LIMIT 1')

    # If the row exists (might have been deleted), do whatever you need with it
    # and create a task for the next row
    process(row)

    # Use the ID of the current row to create a task for the next one
    enqueueTask(row.id + 1)

enqueueTask(1)

This is actually similar to the way the mapreduce implementation provided by Google works, except it is not sharded.

It would be good to enqueue the task for the next row before processing the current row to add parallelism, but in that case you would need to cater to a situation where the task fails in the middle and is restarted, which App Engine's task queues do automatically. A good solution may be to only enqueue processing when scanning the tables, instead of doing it immediately, i.e. enqueueProcessing(row) instead of process(row). That way, you'll quickly create tasks for all the rows, which will be able to run in parallel.

Option 2

Use a B instance (formerly known as backends). You can deploy your worker as a separate module on a B instance. Requests to B instances can run indefinitely. However, you will still need to cater to unexpected halts. Your worked should be able to save state and pick up where it left off in case the instance gets shut down.

Option 3

Load your data into the datastore and then use the provided mapreduce implementation.

This is likely a very slow and expensive solution, but you didn't say what you want to do with your data, so I'll explain it in case migrating off Google Cloud SQL is actually what you're after.

From MySQL, you can export your data in CSV onto your computer: How to output MySQL query results in csv format?
Which you can upload into the datastore: Uploading the data to App Engine

All links are to Python docs, because you didn't say which language you're using. The principles will hold, though.

Upvotes: 1

Razvan Musaloiu-E.
Razvan Musaloiu-E.

Reputation: 1320

You can access Cloud SQL using the IP connectivity from anywhere, including Google Compute Engine. That's the best way to avoid the time constraints of App Engine while still staying close to the Google servers.

Upvotes: 0

Related Questions