Failure loading parquet in Synapse Analytics - INT mapped as UTF8

Question

We have an on-premise Oracle database from which we need to extract data and store this in a Synapse dedicated pool. I have created a Synapse pipeline which first copies the data from Oracle to a datalake in a parquet file, which should then be imported into Synapse using a second copy task. The data from Oracle is extracted through a dynamically created query. This query has 2 hard-coded INT values which are generated at runtime. The query runs fine and the parquet file is created correctly, but if I use polybase or copy command to import the file to Synapse it fails with the following error:

"errorCode": "2200", "message": "ErrorCode=UserErrorSqlDWCopyCommandError,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=SQL DW Copy Command operation failed with error 'HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: ',Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: ,Source=.Net SqlClient Data Provider,SqlErrorNumber=106000,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=106000,State=1,Message=HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: ,},],'",

Bulk insert works but is less efficient on large quantities of data so I don't want to use that. The mapping for the copy activities is created dynamically based on the target database table definition. However, when I created a separate copy task and import the mapping to check what is going on, I noticed that the 2 INT columns are mapped as UTF8 on the parquet source side. The sink table is INT32. When I exclude both columns the copy task completes successfully. It seems that the copy activity fails because it cannot implicitly cast a string to an integer.

The 2 columns are explicitly cast as integers in the Oracle query that is the source for the parquet file.

SELECT  t.* 
    ,   CAST(419 AS INT) AS "Execution_id"
    ,   CAST(4832 AS INT) AS "Task_id"
    ,   TO_DATE('2022-07-05 14:40:34', 'YYYY-MM-DD HH24:MI:SS') AS "ProcessedDTS"
    ,   t.DEMUTDT AS "EffectiveDTS" 
FROM    CBO.DRKASTR t
WHERE   DEMUTDT >= TO_DATE('2022-07-05 13:37:35', 'YYYY-MM-DD HH24:MI:SS');

Adding an explicit mapping for Oracle to parquet mapping them as INT also doesn't solve the problem. How do I prevent these 2 columns from being interpreted as integers instead of strings?!

Failure loading parquet in Synapse Analytics - INT mapped as UTF8

Answers (1)

Related Questions