Reputation: 11
I am going to start working on GCP Data ingestions to BigQuery from CSVs and Data lakes and I am looking for your advice what are the technologies or architecture that I can use.
I am new to GCP but I have good understanding regarding Data Engineering technologies as I am preparing for the certification.
I tried to make a simple loading of BigQuery using Dataflow that converted to Pipeline then scheduler to run the pipeline on specific time. The problem that I have is how can I move the CSV files that has been processed to other bucket? Can I call functions to move the file?
Any advice or documentation you have is highly appreciated.
Thank you in advance.
Regards, CarlRoy
Upvotes: 1
Views: 502
Reputation: 6572
The question is general but I will try to help you with these few explanations.
They are many ways to apply a data ingestion architecture.
- Solution 1 / DAG orchestration with Airflow
and Cloud Composer
You can use a data pipeline orchestrator like Airflow
and Cloud Composer
in Google Cloud
.
Airflow uses DAG
system and Cloud Composer
uses a GKE
cluster.
Airflow
has the responsability to chain all the tasks of your pipeline.
ETL example (Extract, Transform, Load) :
check_csv_file_exist_input_bucket >>
run_dataflow_job >>
move_treaten_files_to_other_bucket
Your Dataflow
job :
Bigquery
Dataflow
runner is serverless and it's based on Apache Beam
model.
Dataflow
uses autoscaling and Beam
proposes a unified model for batch and streaming. It can be used for little, medium and big files and also long running jobs (more that one hour).
Beam
is proposed with Java
, Python
and GO
sdk.
In this example the transformation are applied, then the result is saved to Bigquery
ELT example (Extract, Load, Transform) :
load_csv_inputs_file_bucket_to_staging_table_bigquery >>
run_task_sql_transformation >>
run_task_sql_insertion_final_table >>
move_treaten_files_to_other_bucket
In this example, files are firstly loaded in a staging table to Bigquery
.
Then transformations are applied with SQL
.
At the end, results are written on Bigquery
in a final table.
- Solution 2 / DAG orchestration with Cloud Workflows
Instead of Airflow
and Cloud Composer
, you can use Cloud Workflows
.
Cloud Workflows
is a pipeline orchestrator like Airflow
but the service is serverless.
All the previous examples presented in the Airflow
part are valids with Cloud Workflows
.
Sometimes, if your files are not big and you don't have long running jobs (less that one hour), you can also think of using Cloud Function V2
or Cloud Run
.
All the technologies have pros and cons and you have to choose the services that best fit your use cases.
I share with you an article that I wrote, showing a use case with Dataflow
, Airflow
and Bigquery
:
Remove duplicates in Bigquery batch pipeline with Airflow and Dataflow
Upvotes: 1