shengshan zhang
shengshan zhang

Reputation: 538

spark dataframe save as partitioned table very slowly

df.write.partitionBy("par").format("orc").saveAsTable("mytable")

hello everybody , when i save a spark dataframe as a partitioned hive table, the process is very very slow, does anybody know why? Are there any parameters which should be tuned?

Upvotes: 3

Views: 5183

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

I guess the problem is that the dataframe-partitions are not "aligned" with the hive partitions. This will create many small files per hive partition. This is because each partition of data dataframe contains some data for the hive partition.

Try to repartition the dataframe first on the same column:

df.repartition("par").write.partitionBy("par").format("orc")‌​.saveAsTable("mytabl‌​e")

Upvotes: 1

Related Questions