Gorkem
Gorkem

Reputation: 971

Azure Synapse Serverless External table from Parquet nvarchar limit problem

I am using Azure Synapse Serverless SQL to create a daily dump from datalake which includes historic data. In order to create a current up-to-date version I am doing a CETAS with join on update date and join column (In below example MS_ID and UPDDATE)

Query to create external table is as below (main data is partitioned on day bases with multiple parquet files on same day if there is update)

Since I try to get column information actual columns on parquet have column types. On this specific problem since Azure Synapse tries to converts String type to nvarchar(max) in final join (or nvarchar(8000) ) I am having resource contraints.

"Execution fail against sql server. Sql error number: 65000. Error Message: This query cannot be executed due to current resource constraints."

Question is how can I limit so that "any" Text column from Parquet can be taken as for example nvarchar(1000) without specificying table columns and types explicityly?

CREATE EXTERNAL TABLE [staging].[new_external_table]
 WITH (
 LOCATION = 'staging_zone_table_directory',DATA_SOURCE = staging_zone_managed, 
 FILE_FORMAT = PARQUET_SNAPPY
)
 AS 
WITH 
table_from_adsl as (
 SELECT * , b.filepath(1) AS [partition_year], b.filepath(2) AS [partition_month], b.filepath(3) AS [day] FROM OPENROWSET(
 BULK 'table_datalake_root_directory/year=*/month=*/day=*/*.parquet' ,
 FORMAT = 'PARQUET',
 DATA_SOURCE = 'datalake_container_connection',
 MAXERRORS = 10
 ) AS b 
),max_update_table as (
 SELECT [MS_ID],MAX([UPDDATE]) max_upddate
 FROM table_from_adsl
 GROUP BY [MS_ID]
 )

SELECT DISTINCT t1.* 
 FROM table_from_adsl t1
INNER JOIN max_update_table t2
on t1.[MS_ID]=t2.[MS_ID] AND t1.[UPDDATE] = t2.max_upddate
where partition_year>=2017 

Output of sp_describe_first_result_set for above query is as below :

table description

Upvotes: 1

Views: 2685

Answers (1)

maya-msft
maya-msft

Reputation: 432

The only way to override default type mapping done by system is to provide column/type declarations within the OPENROWSET statement.

Upvotes: 1

Related Questions