Reputation: 1319
What would cause Polybase performance to degrade when querying larger datasets in order to insert records into Azure Data Warehouse from Blob storage?
For example, a few thousand compressed (.gz) CSV files with headers partitioned by a few hours per day across 6 months worth of data. Querying these files from an external table in SSMS is not exactly optimial and it's extremely slow.
Objectively, I'm loading data into Polybase in order to transfer data into Azure Data Warehouse. Except, it seems with large datasets, Polybase is pretty slow.
What options are available to optimize Polybase here? Wait out the query or load the data after each upload to blob storage incrementally?
Upvotes: 2
Views: 2555
Reputation: 13755
In your scenario, Polybase has to connect to the files in the external source, uncompress them, then ensure they fit your external table definition (schema) and then allow the contents to be targeted by the query. When you are processing large amounts of text files in a one-off import fashion, there is nothing to really cache either, since it is dealing with new content every time. In short, your scenario is compute heavy.
Azure Blob Storage will (currently) max out at around 1,250MB/sec, so if your throughput is not near maxing this, then the best way to improve performance is to upgrade your DWU on your SQL data warehouse. In the background, this will spread your workload over a bigger cluster (more servers). SQL Data Warehouse DWU can be scaled either up and down in a matter of minutes.
If you have huge volumes and are maxing the storage, then use multiple storage accounts to spread the load.
Other alternatives include relieving Polybase of the unzip work as part of your upload or staging process. Do this from within Azure where the network bandwidth within a data center is lightning fast.
You could also consider using Azure Data Factory to do the work. See here for supported file formats. GZip is supported. Use the Copy Activity to copy from the Blob storage in to SQL DW.
Also look in to:
Upvotes: 1