nish
nish

Reputation: 7280

spark: case sensitive partitionBy column

I am trying to write out a dataframe in hiveContext(for orc format) with a partition key:

df.write().partitionBy("event_type").mode(SaveMode.Overwrite).orc("/path");

However the column on which I am trying to partition has case sensitive values and this is throwing an error while writing:

Caused by: java.io.IOException: File already exists: file:/path/_temporary/0/_temporary/attempt_201607262359_0001_m_000000_0/event_type=searchFired/part-r-00000-57167cfc-a9db-41c6-91d8-708c4f7c572c.orc

event_type column has both searchFired and SearchFired as values. However if I remove one of them from the dataframe then I am able to write successfully. How do I solve this?

Upvotes: 3

Views: 1522

Answers (1)

Sim
Sim

Reputation: 13538

It is generally not a good idea to rely on case differences in file systems.

The solution is to combine values that differ by case into the same partition using something like (using the Scala DSL):

df
  .withColumn("par_event_type", expr("lower(event_type)"))
  .write
  .partitionBy("par_event_type")
  .mode(SaveMode.Overwrite)
  .orc("/path")

This adds an extra column for partitioning. If that causes problems, you can use drop to remove it when you read the data.

Upvotes: 1

Related Questions