suresiva
suresiva

Reputation: 3173

Load hive partitioned table to Spark Dataframe

I am using Spark 1.4.1 version. I am trying to load a partitioned Hive table in to a DataFrame where in the Hive table is partitioned by the year_week number, at a scenario I might have 104 partitions.

But I could see the DataFrame is getting loaded with the data into 200 partitions and I understand that it is due to the spark.sql.shuffle.partitions set to 200 by default.

I would like to know if there is any good way I can load my Hive table to Spark Dataframe with 104 partitions with making sure that the Dataframe is partitioned by year_week number during the Dataframe load time itself.

The reason for my expectation is that I will be doing few joins with huge volume tables, where all are partitioned by year_week number. So having the Dataframe partitioned by year_week number and loaded accordingly will save me a lot of time from re-partitioning them with year_week number.

Please let me know if you have any suggestions to me.

Thanks.

Upvotes: 1

Views: 5199

Answers (1)

Abhishek Anand
Abhishek Anand

Reputation: 1992

Use hiveContext.sql("Select * from tableName where pt='2012.07.28.10'")

Where, pt= partitionKey, in your case will be year_week and corresponding value with it.

Upvotes: 0

Related Questions