Reputation: 43
I'm really new in this whole data engineering whilst I'm taking this matter as my thesis project, so bear with me.
I'm currently developing a big data platform for a battery storage system that already has CloudSQL services that collect data every 15 seconds (so it already in GCP environment). My job is to COPY THEM AND TRANSFER TO BIGQUERY EACH TIME A DATA CAME IN (including preparing the data using Dataprep) which will then be implemented to machine learning.
I have dug up several ways, one of them using Dataflow, tried once but it was done manually. using jdbc to bigquery
jobs. In order to fulfill my needs (running the jobs regularly), I was recommended using Cloud Composer.
On the other hand, I got another source that uses PubSub which triggers jobs to Dataflow. The latter approach seems more promising but, still, it's better to know both of the worlds. Any suggestion will definitely helps...
Upvotes: 0
Views: 349
Reputation: 75715
To be more efficient, I suggest you to avoid Cloud Composer and Dataflow. You can use Federated queries to request into Cloud SQL directly from BigQuery (if you use MySQL or PostgreSQL engine).
So, perform your
All of these in one request
INSERT INTO <BQ TABLE>
SELECT <Your transform/projection>
FROM EXTERNAL_QUERY(connection_id, <SELECT your more recent data>);
Need to schedule? Use Scheduled queries on bigQuery
Upvotes: 0
Reputation: 4130
You can setup Airflow pipeline(Using Cloud composer) with scheduler, which is much easy and straightforward than dataflow. Airflow GUI has rich features to monitor status and scheduling. There are built in Python operators to connect to AI-platform,BigQuery,CloudSql and lot of other services through the Airflow instance.
Another approach is using Cloud scheduler with Pub/Sub and Cloud functions.You can check this answer to similar kind of use case.
How to start AI-Platform jobs automatically?
Upvotes: 1