Mr F.
Mr F.

Reputation: 43

Automating import of Google Cloud Storage file to Google Cloud MySQL table

Currently, I have a CSV in a GCS bucket that gets updated on a daily basis by a pipeline hosted on Google Cloud. I've used this CSV to populate a table in a MySQL database via the Cloud UI Import feature without issue.

What I'd like to do is have this import run every day at a set time to update the table data since my pipeline will be updating my CSV. I don't see any UI features to do this and am not sure what the best way to proceed would be otherwise. I'm much more of a data analyst than a SWE, so trying to decipher functionality like Cloud Run is a bit tricky...

Upvotes: 0

Views: 1798

Answers (3)

Mr F.
Mr F.

Reputation: 43

Update: I wound up doing a few things to solve my issue. TL;DR BigQuery + an event-triggered function:

  1. I ditched my MySQL instance in favor of a BigQuery table. I considered using a federated table (external data source, for those new to this like me) but with slow read times and the possible need to explore the data regularly, I decided this was too clunky for me.

  2. I created a function that triggered when files were added or overwritten in my GCS bucket. This function caused a CSV import to my table and with a few modifications allowed me to truncate (overwrite) the data, which is useful for how my pipeline is setup. Credit to guillaume blaquiere and Mehmet Karakose for suggesting this path, but by itself I was not skilled enough to actually write and setup the function on my own.

  3. To achieve #2, I used this very handy guide from rickt.org: https://rickt.org/2018/10/22/poc-automated-insert-of-csv-data-into-bigquery-via-gcs-bucket-python/. This gave a step-by-step process for writing and setting up the function to trigger on GCS bucket changes.

  4. For those wondering about my modifications:

    a. I added a job_config.write_disposition line set to WRITE_TRUNCATE for overwriting

    b. I also changed the uri value to only include one CSV that I specified instead of all files in the bucket.

There was a small permissioning issue to make sure my file/bucket was accessible by the service account associated with the function, but nothing too dramatic. Testing the function showed success, with my BigQuery table being updated as expected. I will monitor over the next few nights to make sure this continues to work and edit this response in case I am mistaken for any unforeseen reason.

Upvotes: 1

guillaume blaquiere
guillaume blaquiere

Reputation: 76000

If you have CSV, and you want to visualize data in Datastudio, I recommend you to use BigQuery. It's a petabyte datawarehouse with tons of features!!

One of them is the capacity to read directly the data from Cloud Storage and to query them: Federated queries. It's not the most efficient, but it's the most useful for you.

You can also read the file and store the result in a table with a INSERT ... SELECT ... statement.

Finally you can load the CSV file in BigQuery. Not really hard also but need more code (catch the event when a file is created in Cloud Storage, and then load it to BigQuery). But for simple piece of code like that, I'm sure you can find examples with Cloud Functions and Cloud Run.


So, when the data are in BigQuery, you can simply request them with a SQL query, as you can do with MySQL. BigQuery is a serverless product, you pay only the volume of data stored (if stored in BigQuery, else it's the volume stored in Cloud Storage), and for the volume of data that you process.

You have also a lot of great article to explain how to limit cost on BigQuery with Partitioning and Clustering; I also wrote an article where I talk about Quotas

Upvotes: 0

Mehmet Karakose
Mehmet Karakose

Reputation: 76

Currently, GCP have default pipeline for their products, BigQuery, Cloud Spanner, DataStore etc. That pipelines providing from Dataflow. Dataflow is not support the Text Files on Cloud storage to Cloud SQL (Mysql etc.)

I think, you can write functions. That functions, If doc added Cloud Storage, it read all doc and write the related part of table in MySQL.

Or

You can automate import to MySQL, using gcloud command tools or restapi(beta). For better understanding read this doc.

Upvotes: 0

Related Questions