Reputation: 3173
I am using Spark 1.4.1
version. I am trying to load a partitioned Hive table in to a DataFrame where in the Hive table is partitioned by the year_week
number, at a scenario I might have 104 partitions.
But I could see the DataFrame is getting loaded with the data into 200 partitions and I understand that it is due to the spark.sql.shuffle.partitions
set to 200 by default.
I would like to know if there is any good way I can load my Hive table to Spark Dataframe with 104 partitions with making sure that the Dataframe is partitioned by year_week
number during the Dataframe load time itself.
The reason for my expectation is that I will be doing few joins with huge volume tables, where all are partitioned by year_week
number. So having the Dataframe partitioned by year_week
number and loaded accordingly will save me a lot of time from re-partitioning them with year_week
number.
Please let me know if you have any suggestions to me.
Thanks.
Upvotes: 1
Views: 5199
Reputation: 1992
Use hiveContext.sql("Select * from tableName where pt='2012.07.28.10'")
Where, pt= partitionKey, in your case will be year_week and corresponding value with it.
Upvotes: 0