Azure Synapse pyspark translates STRING datatype into varchar(8000) for external table

Question

I try to load an external table in Azure Synpase using a PySpark notebook but the datatypes seem to mismatch. When I initialize the table I execute (stripped down example):

CREATE OR REPLACE TABLE LANDING_DB.Opportunity (
 AccountId varchar(255),
 Id varchar(255),
 Name STRING,
 Description STRING
)
USING DELTA LOCATION 'abfss://XXX@XXX.dfs.core.windows.net/Landing/ABC/Opportunity/'

via

spark.sql()

Yet the external table uses 'varchar(8000)' as datatype for the Name and Description column. I would like to use varchar(max) as especially the Description column can have a lot of text.

Am I missing some settings or is this a bug?

The database is the serverless pool on the synapse instance.

Help would be appreciated, thank you in advance

Best, Sven

raaph · Accepted Answer

First of all note the tip: Try to make VARCHAR() as short as possible. Avoid VARCHAR(MAX) if possible because it can impair performance.

It seems that it makes STRING automatically to varchar(8000) because it infers the length. See https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/sql/best-practices-serverless-sql-pool.md#check-inferred-data-types

Instead of using STRING you can use varchar(x) and then put a higher number. Maybe you know what the maximum could be in the Description? varchar() takes an integer as the argument, and the max value of an integer is 2,147,483,647. I don't think your description can get that long? So better put a lower number which will also not be reached.

Azure Synapse pyspark translates STRING datatype into varchar(8000) for external table

Answers (1)

Related Questions