Copy data from Parquet file to a table in Azure Synapse by using Polybase T-SQL failing when using a partition column

Question

I tried to copy a parquet file to a table in Azure Synapse by using Polybase T-SQL. Here is an example:

data = [["Jean", 15, "Tennis"], ["Jane", 20, "Yoga"], ["Linda", 35, "Yoga"], ["Linda", 35, "Tennis"]]
columns = ["Name", "Age", "Sport"]
df = spark.createDataFrame(data, columns)

Then I save the dataframe as a parquet file by partitioning by the column "Sport":

df\
.write\
.option("header", True)\
.partitionBy("Sport")\
.format("parquet")\
.mode("overwrite")\
.save("/mnt/test_partition_polybase.parquet")

Then I use Polybase T-SQL for copying from a parquet file to a table in Synapse:

IF OBJECT_ID(N'EVENTSTORE.TEST') IS NOT NULL BEGIN DROP EXTERNAL TABLE EVENTSTORE.TEST END 
CREATE EXTERNAL TABLE EVENTSTORE.TEST(
    [Name] NVARCHAR(250),
    [Age] BIGINT,     
    [SPORT] NVARCHAR(250)
    )
 WITH (DATA_SOURCE = [my_data_source],LOCATION = N'test_partition_polybase.parquet',FILE_FORMAT = [SynapseParquetFormat],REJECT_TYPE = VALUE,REJECT_VALUE = 0)

I get the error:

External file access failed due to internal error: 'File test_partition_polybase.parquet/Sport=Tennis/part-00000-tid-5109481329839379912-631db0ad-cd52-4f9e-acf6-76828a8aa4eb-67-1.c000.snappy.parquet: HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader: HadoopExecutionException: Column count mismatch. Source file has 2 columns, external table definition has 3 columns

It's because the partitioned column "Sport" is not in the partitioned file. How can I solve this issue?

Copy data from Parquet file to a table in Azure Synapse by using Polybase T-SQL failing when using a partition column

Answers (1)

Related Questions