vignesh
vignesh

Reputation: 1454

Best way to ingest data to bigquery

I have heterogeneous sources like flat files residing on prem, json on share point, api which serves data so and so. Which is the best etl tool to bring data to bigquery environment ?

Im a kinder garden student in GCP :)

Thanks in advance

Upvotes: 1

Views: 1390

Answers (3)

Alfredo Prada Giorgi
Alfredo Prada Giorgi

Reputation: 54

If you are starting from scratch and have no legacy tools to carry with you, the following GCP managed products target your use case:

  • Cloud Data Fusion, "a fully managed, code-free data integration service that helps users efficiently build and manage ETL/ELT data pipelines"

  • Cloud Composer, "a fully managed data workflow orchestration service that empowers you to author, schedule, and monitor pipelines"

  • Dataflow, "a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing"

(Without considering a myriad of data integration tools and fully customized solutions using Cloud Run, Scheduler, Workflows, VMs, etc.)

Choosing one depends on your technical skills, real-time processing needs, and budget. As mentioned by Guillaume Blaquiere, if BigQuery is your only destination, you should try to leverage BigQuery's processing power on your data transformation.

Upvotes: 1

guillaume blaquiere
guillaume blaquiere

Reputation: 76018

After several tries and datalake/datawarehouse design and architecture, I can recommend you only 1 thing: ingest your data as soon as possible in BigQuery; no matter the format/transformation.

Then, in BigQuery, perform query to format, clean, aggregate, value your data. It's not ETL, it's ELT: you start by loading your data and then you transform them.

It's quicker, cheaper, simpler, and only based on SQL.

It works only if you use ONLY BigQuery as destination.

Upvotes: 1

dank
dank

Reputation: 178

There are many solutions to achieve this. It depends on several factors some of which are:

  1. frequency of data ingestion
  2. whether or not the data needs to be manipulated before being written into bigquery (your files may not be formatted correctly)
  3. is this going to be done manually or is this going to be automated
  4. size of the data being written

If you are just looking for an ETL tool you can find many. If you plan to scale this to many pipelines you might want to look at a more advanced tool like Airflow but if you just have a few one-off processes you could set up a Cloud Function within GCP to accomplish this. You can schedule it (via cron), invoke it through HTTP endpoint, or pub/sub. You can see an example of how this is done here

Upvotes: 1

Related Questions