Blue Clouds
Blue Clouds

Reputation: 8191

Why Azure Data Factory when we have Polybase

Our requirement is to take data from Blob Storage and convert to other tabular form. This can be achieved with Sql DW using polybase. What really is the role of Azure data factory in such cases?

I know the same objective can be met with Azure data factory. But isn't Azure DW with Polybase the easiest option and cost effective?

Upvotes: 3

Views: 5024

Answers (2)

Troy Witthoeft
Troy Witthoeft

Reputation: 2676

On the face if it, sure just use Polybase, it's simple. It uses TSQL. But, how much data are we talking about? Different formats? Will it need cleaning or transformation? As those needs grow, the answer may change. If you have a lot of data to transform, maybe you don't want to spend your DW CPU cycles on transformation instead of querying? So,like most thing here the answer... depends. You could use Polybase, SSIS, ADF, HDInsight, etc... James Serra does a good job laying out all the options. https://www.jamesserra.com/archive/2019/01/what-product-to-use-to-transform-my-data/

Upvotes: 2

wBob
wBob

Reputation: 14399

Polybase can only really do one thing - load data to Azure Synapse Analytics (formerly known as Azure SQL Data Warehouse) or SQL Server with Polybase enabled. More precisely, Polybase acts as a virtualisation layer for flat files stored in storage or data lake allowing them to be presented in the database as external tables or make them available for load into the database as a physical table, eg via CTAS.

It does that very well and is the recommended way to load data into Synapse / Warehouse assuming you select the right DWU and resource class for your data load.

Polybase has a couple of nice extra features around rejecting error rows and a few different file types / separators (comma, pipe etc), but really that’s it.

So if all you need to do is load some files or put a virtualisation layer on, then yes you could just use Polybase.

However let me ask you a few questions:

  • how do you intend to do your orchestration?
  • how do you intend to do your scheduling? Event or time-based?
  • how do you intend to notify someone when the job is complete or errors?
  • how do you plan to start multiple loads simultaneously? Azure Data Factory (ADF) has a great For Each task which can execute up to 20 tasks in parallel for example
  • do you want to run some stored procedures before or after the load? Or any other ELT activities?

Hopefully that helps explain the place of Polybase which is normally at least a piece of the puzzle. You might look at more lightweight options than ADF like Logic Apps, Azure Run Books etc if you have a more lightweight ELT process.

HTH

Upvotes: 6

Related Questions