Load hive partitioned table to Spark Dataframe

Question

I am using Spark 1.4.1 version. I am trying to load a partitioned Hive table in to a DataFrame where in the Hive table is partitioned by the year_week number, at a scenario I might have 104 partitions.

But I could see the DataFrame is getting loaded with the data into 200 partitions and I understand that it is due to the spark.sql.shuffle.partitions set to 200 by default.

I would like to know if there is any good way I can load my Hive table to Spark Dataframe with 104 partitions with making sure that the Dataframe is partitioned by year_week number during the Dataframe load time itself.

The reason for my expectation is that I will be doing few joins with huge volume tables, where all are partitioned by year_week number. So having the Dataframe partitioned by year_week number and loaded accordingly will save me a lot of time from re-partitioning them with year_week number.

Please let me know if you have any suggestions to me.

Thanks.

Load hive partitioned table to Spark Dataframe

Answers (1)

Related Questions