alxsbn
alxsbn

Reputation: 382

Azure SQL Data Warehouse (Synapse Analytics) Polybase performances with ORC table

I generate an ORC table (compresssed w/ Snappy) with Spark (Databricks) on an Azure Storage Account (w/ ADLS Gen2 feature). This ORC represent about 12 GB of data (1.2 billions lines). This table has 32 columns.

Once it's generated, I load this file inside an Internal table within Synapse Analytics table using Polybase.

Here my results with different configuration :

When I look at Storage Account ingress/egress, I saw activity during a few minutes (maybe for copying the ORC files between Synapse nodes) ...... then Synapse resources begin to be stressed. I saw CPU activity for a while then memory increase slowly, slowy, ...

Here memory (red) and CPU max % (blue) example :

enter image description here

Do I need to scale up again ? I don't think this is a pb of network througput. Or maybe a configuration problem ? In regard of Polybase I doesn't understand why this is so slow. Polybase is suppose to ingest TB of ORC data quickly !

BR, A.

Edit: DWU usage

enter image description here

Upvotes: 2

Views: 1117

Answers (1)

wBob
wBob

Reputation: 14389

There are a couple of things you can try. Azure Synapse Analytics (formerly known as Azure SQL Data Warehosue) has a concept of readers and writers tied to the DWU. I can't find a current version of this documentation, but some old gen1 docs I have indicates DWU1500 has 120 readers. This strongly suggests you should split your one big file up into many files.

I would do some experiments, starting at 10, ie 10 files of 1.2GB each and work up until you find an optimal setting for your workload. I would say I have not tested this with ORC files not it's not clear to me if the ORC format is already inherently partitioned. Try it and see.

You can also try CTAS (if you're not already using it). This will also take advantage of Synapse's ability to parallelise work.

There is also a new feature currently in preview called COPY INTO. According to the documentation it is compatible with ORC files and does not require you to split them:

What is the file splitting guidance for the COPY command loading Parquet or ORC files? There is no need to split Parquet and ORC files because the COPY command will automatically split files. Parquet and ORC files in the Azure storage account should be 256MB or larger for best performance.

https://learn.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest#what-is-the-file-splitting-guidance-for-the-copy-command-loading-parquet-or-orc-files

COPY INTO test_orc
FROM 'https://yourAccount.blob.core.windows.net/yourBlobcontainer/folder1/*.orc'
WITH (
    FILE_FORMAT = yourFileFormat
    CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>')
)

Work out if you are DWU-bound by viewing the DWU usage in the portal - see if it's maxed out / flatlined, which I guess it isn't.

Upvotes: 0

Related Questions