IoT user
IoT user

Reputation: 1300

Cloud SQL: export data to CSV periodically avoiding duplicates

I want to export the data from Cloud SQL (postgres) to a CSV file periodically (once a day for example) and each time the DB rows are exported it must not be exported in the next export task.

I'm currently using a POST request to perform the export task using cloud scheduler. The problem here (or at least until I know) is that it won't be able to export and delete (or update the rows to mark them as exported) in a single http export request.

Is there any possibility to delete (or update) the rows which have been exported automatically with any Cloud SQL parameter in the http export request?

If not, I assume it should be done it a cloud function triggered by a pub/sub (using scheduler to send data once a day to pub/sub) but, is there any optimal way to take all the ID of the rows retrieved from the select statment (which will be use in the export) to delete (or update) them later?

Upvotes: 0

Views: 956

Answers (4)

IoT user
IoT user

Reputation: 1300

Thank you for all your answers. There are multiples ways of doing this, so I'm goint to explain how I did it.

In the database I have included a column which contains the date when the data was inserted.

I used a cloud scheduler with the following body:

{"exportContext":{"fileType": "CSV", "csvExportOptions" :{"selectQuery" : "select \"column1\", \"column2\",... , \"column n\" from public.\"tablename\" where \"Insertion_Date\" = CURRENT_DATE - 1" },"uri": "gs://bucket/filename.csv","databases": ["postgres"]}}

This scheduler will be triggered once a day and it will export only the data of the previous day

Also, I have to noticed that in the query I used in cloud scheduler you can choose which columns you want to export, doing this you can avoid to export the column which include the Insertion_Date and use this column only an auxiliary.

Finally, the cloud scheduler will create automatically the csv file in a bucket

Upvotes: 1

ZUKINI
ZUKINI

Reputation: 195

Another method aside from @jjanes would be to partition your database by date. This would allow you to create an index on the date, making exporting or deleting a days entries very easy. With this implementation, you could also create a Cron Job that deletes all tables older then X days ago.

The documentation provided will walk you through setting up a Ranged partition

The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, one might partition by date ranges, or by ranges of identifiers for particular business objects.

Upvotes: 1

Enrique Del Valle
Enrique Del Valle

Reputation: 520

You could use a set up to achieve what you are looking for: 

1.Create a Cloud Function to extract the information from the database that subscribes to a Pub/Sub topic. 2.Create a Pub/Sub topic to trigger that function. 3.Create a Cloud Scheduler job that invokes the Pub/Sub trigger. 4.Run the Cloud Scheduler job. 5.Then create a trigger which activate another Cloud Function to delete all the data require from the database once the csv has been created.

Here I leave you some documents which could help you if you decide to follow this path.

Using Pub/Sub to trigger a Cloud Function:https://cloud.google.com/scheduler/docs/tut-pub-sub

Connecting to Cloud SQL from Cloud Functions:https://cloud.google.com/sql/docs/mysql/connect-functionsCloud

Storage Tutorial:https://cloud.google.com/functions/docs/tutorials/storage

Upvotes: 1

jjanes
jjanes

Reputation: 44363

You can export and delete (or update) at the same time using RETURNING.

\copy (DELETE FROM pgbench_accounts WHERE aid<1000 RETURNING *) to foo.txt

The problem would be in the face of crashes. How can you know that foo.txt has been writing and flushed to disk, before the DELETE is allowed to commit? Or the reverse, foo.txt is partially (or fully) written, but a crash prevents DELETE from committing.

Can't you make the system idempotent, so that exporting the same row more than once doesn't create problems?

Upvotes: 1

Related Questions