Azure SQL Data Warehouse (Synapse Analytics) Polybase performances with ORC table

Question

I generate an ORC table (compresssed w/ Snappy) with Spark (Databricks) on an Azure Storage Account (w/ ADLS Gen2 feature). This ORC represent about 12 GB of data (1.2 billions lines). This table has 32 columns.

Once it's generated, I load this file inside an Internal table within Synapse Analytics table using Polybase.

Here my results with different configuration :

DW100c / smallrc = 3h52
DW400c / smallrc = 1h50
DW400c / xlargerc = 1h58
DW1000c / xlargerc = 0h50
DW1500c / xlargerc = 0h42

When I look at Storage Account ingress/egress, I saw activity during a few minutes (maybe for copying the ORC files between Synapse nodes) ...... then Synapse resources begin to be stressed. I saw CPU activity for a while then memory increase slowly, slowy, ...

Here memory (red) and CPU max % (blue) example :

Do I need to scale up again ? I don't think this is a pb of network througput. Or maybe a configuration problem ? In regard of Polybase I doesn't understand why this is so slow. Polybase is suppose to ingest TB of ORC data quickly !

BR, A.

Edit: DWU usage

Azure SQL Data Warehouse (Synapse Analytics) Polybase performances with ORC table

Answers (1)

Related Questions