BCL
BCL

Reputation: 21

Google Datafusion : Loading multiple small tables daily

I want to load about 100 small tables (min 5 records, max 10000 records) from SQL Server into Google BigQuery on a daily basis. We have created 100 Datafusion pipelines, one pipeline per source table. When we start one pipeline it takes about 7 minutes to execute. Offcourse its starts DataProc, connects to SQL server and sinks the data into Google BigQuery. When we have to run this sequentially it will take 700 minutes? When we try to run in pipelines in parallel we are limited by the network range which would be 256/3. 1 pipeline starts 3 VM's one master 2 slaves. We tried but the performance is going down when we start more than 10 pipelines in parallel. Questions. Is this the right approach?

Upvotes: 2

Views: 973

Answers (2)

Ilya Berdichevsky
Ilya Berdichevsky

Reputation: 1298

Multiple Data Fusion pipelines can use the same pre-provisioned Dataproc cluster. You need to create the Remote Hadoop Provisioner compute profile for the Data Fusion instance.

This feature is only available in Enterprise edition.

How setup compute profile for the Data Fusion instance.

Upvotes: 1

aga
aga

Reputation: 3893

When multiple pipelines are running at the same time, there are multiple Dataproc clusters running behind the scenes with more VMs and require more disk. There are some plugins to help you out with multiple source tables. Correct plugin to use should be CDAP/Google plugin called Multiple Table Plugins as it allows for multiple source tables.

In the Data Fusion studio, you can find it in Hub -> Plugins.

To see full lists of available plugins, please visit official documentation.

Upvotes: 1

Related Questions