Reputation: 339
I am confused about how spark creates partitions in spark dataframe. Here is the list of steps and the partition size
i_df = sqlContext.read.json("json files") // num partitions returned is 4, total records 7000
p_df = sqlContext.read.format("csv").Other options // num partitions returned is 4 , total records: 120k
j_df = i_df.join(p_df, i_df.productId == p_df.product_id) // total records 7000, but num of partitions is 200
first two dataframes have 4 partitions, but as soon as i join them it shows 200 partitions. I was expecting that it will make 4 partitions after joining, but why is it showing 200.
I am running it on local with conf.setIfMissing("spark.master", "local[4]")
Upvotes: 0
Views: 479
Reputation: 28744
The 200 is the default shuffle partition size. you can change it by setting spark.sql.shuffle.partitions
Upvotes: 4