Reputation: 7280
I am trying to write out a dataframe in hiveContext(for orc format) with a partition key:
df.write().partitionBy("event_type").mode(SaveMode.Overwrite).orc("/path");
However the column on which I am trying to partition has case sensitive values and this is throwing an error while writing:
Caused by: java.io.IOException: File already exists: file:/path/_temporary/0/_temporary/attempt_201607262359_0001_m_000000_0/event_type=searchFired/part-r-00000-57167cfc-a9db-41c6-91d8-708c4f7c572c.orc
event_type
column has both searchFired
and SearchFired
as values. However if I remove one of them from the dataframe then I am able to write successfully. How do I solve this?
Upvotes: 3
Views: 1522
Reputation: 13538
It is generally not a good idea to rely on case differences in file systems.
The solution is to combine values that differ by case into the same partition using something like (using the Scala DSL):
df
.withColumn("par_event_type", expr("lower(event_type)"))
.write
.partitionBy("par_event_type")
.mode(SaveMode.Overwrite)
.orc("/path")
This adds an extra column for partitioning. If that causes problems, you can use drop
to remove it when you read the data.
Upvotes: 1