SandeshR
SandeshR

Reputation: 213

External Table in Azure synapse very slow performance

I have a parquet file and created a new External table, but the performance is very slow as compare to a normal table in the synapse. Can you please let me know how to over come this.

Upvotes: 2

Views: 5406

Answers (2)

Vinny
Vinny

Reputation: 55

a) The first step is to partition your Parquet File using a relevant partition column, such as Year, Month, and Date.

b) I recommend using a View rather than an external table as a second recommendation. External Tables don't support Partition Pruning and won't use the partition columns to eliminate unnecessary files during the read.

c) Assure that data types are enforced, and that string types are being used appropriately.

d) If possible, convert your Parquet file to Delta format. Synapse is able to read Partition columns from Delta without the need for the filepath() and filename() functions. External tables do not support Delta, only views.

Note: External tables doesn't support Parquet partition columns.

SELECT *,
  CAST(fct.filepath(1) AS SMALLINT) AS SalesOrderPathYear,
  CAST(fct.filepath(2) AS TINYINT) AS SalesOrderPathMonth,
  CAST(fct.filepath(3) AS DATE) AS SalesOrderPathDate
  FROM 
   OPENROWSET
   (
      BULK 'conformed/facts/factsales/*/*/*/*.parquet',
      DATA_SOURCE = 'ExternalDataSourceDataLake',
      FORMAT = 'Parquet'
   ) AS fct
  WITH
  ( 
    ColA as String(10),
    ColB as Integer,
    ColC as ... 
  )

Ref: https://www.serverlesssql.com/certification/mastering-dp-500-exam-querying-partitioned-sources-in-azure-storage/

Upvotes: 0

Kashyap
Kashyap

Reputation: 17401

Very broad question. So I'll give broad answer:

  1. Use normal table. Hard to beat performance of "normal table" with external tables. "normal table" means a table created in a Dedicated SQL pool using CREATE TABLE. If you're querying data from one or more tables repeatedly and each query is different (group-by, join, selected columns) then you can't get beat performance of "normal" table with external tables.
  2. Understand and apply basic best practices:
    • Use parquet format, which you're doing.

    • Pick right partition column and partition your data by storing partitions to different folders or file names.

    • If a query targets a single large file, you'll benefit from splitting it into multiple smaller files.

    • Try to keep your CSV (if using csv) file size between 100 MB and 10 GB.

    • Use correct data type.

    • Manually create statistics for CSV files

    • Use CETAS to enhance query performance and joins

      ...and many more.

Upvotes: 2

Related Questions