CarlRoy
CarlRoy

Reputation: 11

GCP Data Ingestion Architecture

I am going to start working on GCP Data ingestions to BigQuery from CSVs and Data lakes and I am looking for your advice what are the technologies or architecture that I can use.

I am new to GCP but I have good understanding regarding Data Engineering technologies as I am preparing for the certification.

I tried to make a simple loading of BigQuery using Dataflow that converted to Pipeline then scheduler to run the pipeline on specific time. The problem that I have is how can I move the CSV files that has been processed to other bucket? Can I call functions to move the file?

Any advice or documentation you have is highly appreciated.

Thank you in advance.

Regards, CarlRoy

Upvotes: 1

Views: 502

Answers (1)

Mazlum Tosun
Mazlum Tosun

Reputation: 6572

The question is general but I will try to help you with these few explanations.

They are many ways to apply a data ingestion architecture.

- Solution 1 / DAG orchestration with Airflow and Cloud Composer

You can use a data pipeline orchestrator like Airflow and Cloud Composer in Google Cloud. Airflow uses DAG system and Cloud Composer uses a GKE cluster. Airflow has the responsability to chain all the tasks of your pipeline.

ETL example (Extract, Transform, Load) :

check_csv_file_exist_input_bucket >> 
run_dataflow_job >> 
move_treaten_files_to_other_bucket

Cloud Composer doc

Airflow doc

Your Dataflow job :

  • Reads CSV files from input bucket
  • Applies business rules, aggregations and transformations
  • Writes results to Bigquery

Dataflow runner is serverless and it's based on Apache Beam model. Dataflow uses autoscaling and Beam proposes a unified model for batch and streaming. It can be used for little, medium and big files and also long running jobs (more that one hour). Beam is proposed with Java, Python and GO sdk.

Dataflow doc

Beam doc

In this example the transformation are applied, then the result is saved to Bigquery

ELT example (Extract, Load, Transform) :

load_csv_inputs_file_bucket_to_staging_table_bigquery >> 
run_task_sql_transformation >> 
run_task_sql_insertion_final_table >> 
move_treaten_files_to_other_bucket

In this example, files are firstly loaded in a staging table to Bigquery. Then transformations are applied with SQL. At the end, results are written on Bigquery in a final table.

- Solution 2 / DAG orchestration with Cloud Workflows

Instead of Airflow and Cloud Composer, you can use Cloud Workflows. Cloud Workflows is a pipeline orchestrator like Airflow but the service is serverless.

Cloud Workflows doc

All the previous examples presented in the Airflow part are valids with Cloud Workflows.

Sometimes, if your files are not big and you don't have long running jobs (less that one hour), you can also think of using Cloud Function V2 or Cloud Run.

All the technologies have pros and cons and you have to choose the services that best fit your use cases.

I share with you an article that I wrote, showing a use case with Dataflow, Airflow and Bigquery :

Remove duplicates in Bigquery batch pipeline with Airflow and Dataflow

Upvotes: 1

Related Questions