Reputation: 971
I am using Azure Synapse Serverless SQL to create a daily dump from datalake which includes historic data. In order to create a current up-to-date version I am doing a CETAS with join on update date and join column (In below example MS_ID and UPDDATE)
Query to create external table is as below (main data is partitioned on day bases with multiple parquet files on same day if there is update)
Since I try to get column information actual columns on parquet have column types. On this specific problem since Azure Synapse tries to converts String type to nvarchar(max) in final join (or nvarchar(8000) ) I am having resource contraints.
"Execution fail against sql server. Sql error number: 65000. Error Message: This query cannot be executed due to current resource constraints."
Question is how can I limit so that "any" Text column from Parquet can be taken as for example nvarchar(1000) without specificying table columns and types explicityly?
CREATE EXTERNAL TABLE [staging].[new_external_table]
WITH (
LOCATION = 'staging_zone_table_directory',DATA_SOURCE = staging_zone_managed,
FILE_FORMAT = PARQUET_SNAPPY
)
AS
WITH
table_from_adsl as (
SELECT * , b.filepath(1) AS [partition_year], b.filepath(2) AS [partition_month], b.filepath(3) AS [day] FROM OPENROWSET(
BULK 'table_datalake_root_directory/year=*/month=*/day=*/*.parquet' ,
FORMAT = 'PARQUET',
DATA_SOURCE = 'datalake_container_connection',
MAXERRORS = 10
) AS b
),max_update_table as (
SELECT [MS_ID],MAX([UPDDATE]) max_upddate
FROM table_from_adsl
GROUP BY [MS_ID]
)
SELECT DISTINCT t1.*
FROM table_from_adsl t1
INNER JOIN max_update_table t2
on t1.[MS_ID]=t2.[MS_ID] AND t1.[UPDDATE] = t2.max_upddate
where partition_year>=2017
Output of sp_describe_first_result_set
for above query is as below :
Upvotes: 1
Views: 2685
Reputation: 432
The only way to override default type mapping done by system is to provide column/type declarations within the OPENROWSET statement.
Upvotes: 1