TheMP
TheMP

Reputation: 8427

Saving to parquet subpartition

I have a directory structure based on two partitions, like this:

  People
  > surname=Doe
        > name=John
        > name=Joe
  > surname=White
        > name=Josh
        > name=Julien

I am reading parquet files with information only about all Does, and therefore I am directly specifying surname=Doe as an output directory for my DataFrame. Now the problem is I am trying to add name-based partitioning with partitionBy("name") on writing.

df.write.partitionBy("name").parquet(outputDir)

(outputDir contains a path to Doe directory)

This causes an error like below:

  Caused by: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
    Partition column name list #0: surname, name
    Partition column name list #1: surname

Any tips how to solve it? It probably occurs because of the _SUCCESS file created in the surname directory, which gives wrong hints to Spark - when I remove _SUCCESS and _metadata files Spark is able to read everything without any issue.

Upvotes: 11

Views: 3272

Answers (2)

TheMP
TheMP

Reputation: 8427

I have managed to solve it with a workaround - I don't think this is a good idea, but I disabled creating additional _SUCCESS and _metadata files with:

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

That way Spark won't get any stupid ideas about the partitioning structures.

Another option is saving to the "proper" directory - People and partition by surname and name, but then you have to keep in mind that the only sane option is setting SaveMode to Append and manually deleting the directories you expect to be overwritten (this is really error-prone):

df.write.mode(SaveMode.Append).partitionBy("surname","name").parquet("/People")

Do not use owerwrite SaveMode in this case - this will delete ALL of the surname directores.

Upvotes: 8

Ewan Leith
Ewan Leith

Reputation: 1665

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

is fairly sensible, if you have summary metadata enabled then writing the metadata file can become an IO bottleneck on reads and writes.

The alternative way to your solution might be to add a .mode("append") to your write, but with the original parent directory as the destination,

df.write.mode("append").partitionBy("name").parquet("/People")

Upvotes: 2

Related Questions