Reputation: 155
I tried to copy a parquet file to a table in Azure Synapse by using Polybase T-SQL. Here is an example:
data = [["Jean", 15, "Tennis"], ["Jane", 20, "Yoga"], ["Linda", 35, "Yoga"], ["Linda", 35, "Tennis"]]
columns = ["Name", "Age", "Sport"]
df = spark.createDataFrame(data, columns)
Then I save the dataframe as a parquet file by partitioning by the column "Sport":
df\
.write\
.option("header", True)\
.partitionBy("Sport")\
.format("parquet")\
.mode("overwrite")\
.save("/mnt/test_partition_polybase.parquet")
Then I use Polybase T-SQL for copying from a parquet file to a table in Synapse:
IF OBJECT_ID(N'EVENTSTORE.TEST') IS NOT NULL BEGIN DROP EXTERNAL TABLE EVENTSTORE.TEST END
CREATE EXTERNAL TABLE EVENTSTORE.TEST(
[Name] NVARCHAR(250),
[Age] BIGINT,
[SPORT] NVARCHAR(250)
)
WITH (DATA_SOURCE = [my_data_source],LOCATION = N'test_partition_polybase.parquet',FILE_FORMAT = [SynapseParquetFormat],REJECT_TYPE = VALUE,REJECT_VALUE = 0)
I get the error:
External file access failed due to internal error: 'File test_partition_polybase.parquet/Sport=Tennis/part-00000-tid-5109481329839379912-631db0ad-cd52-4f9e-acf6-76828a8aa4eb-67-1.c000.snappy.parquet: HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader: HadoopExecutionException: Column count mismatch. Source file has 2 columns, external table definition has 3 columns
It's because the partitioned column "Sport" is not in the partitioned file. How can I solve this issue?
Upvotes: 1
Views: 788
Reputation: 155
I finally solved my problem by creating a duplicated column:
df\
.withColumn("part_Sport", F.col("Sport"))\
.write\
.option("header", True)\
.partitionBy("part_Sport")\
.format("parquet")\
.mode("overwrite")\
.save("/mnt/test_partition_polybase.parquet")
I don't know if it is the best solution but that's the only one I have found.
Upvotes: 0