Reputation: 2882
We are running SQL 2019 with CU 12 with an external data source that points to ADLS Gen2 storage account. We have two parquet files in the same directory where one file has 2 columns and the other file has 3 columns. We purposely did this to test the reject options knowing that our schemas will change over time.
/employee/file1.csv (2 columns/5 rows)
/employee/file2.csv (3 columns/5 rows)
Based on the documentation for reject options, we should be able to query across the external table and return non-dirty rows in the result set if reject rows fall within the reject configuration which is listed below.
CREATE EXTERNAL TABLE [dbo].[Employee] (
[FirstName] varchar(100) NOT NULL,
[LastName] varchar(100) NOT NULL
)
WITH (LOCATION='/employee/',
DATA_SOURCE = DATA_LAKE,
FILE_FORMAT = ParquetFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 1000000
);
When we select from the external table, I would expect to have it return the 5 rows from the one file that has 2 columns and reject the 5 rows from the file that contains 3 columns. Instead, we get no rows at all with the following exception.
Unexpected error encountered creating the record reader. HadoopExecutionException: Column count mismatch. Source file has 3 columns, external table definition has 2 columns.
I feel like I must be missing something or my understanding of how reject options support file schema differences is incorrect. Can anyone shed any light on this?
Upvotes: 0
Views: 594
Reputation: 131
The way Polybase works is that it first queries the schema on both files to see if they match the specification; since they don't, Polybase doesn't query one file successfully and fails on the other one. Both files must adhere to the external table specification first, and then you can have some records with two columns and other records with three columns. You can learn more about how Polybase works in my book "Hands-on data virtualization with Polybase".
Upvotes: 0