Aravind Chamakura
Aravind Chamakura

Reputation: 339

Spark dataframe partition count

I am confused about how spark creates partitions in spark dataframe. Here is the list of steps and the partition size

i_df = sqlContext.read.json("json files")  // num partitions returned is 4, total records 7000
p_df = sqlContext.read.format("csv").Other options   // num partitions returned is 4 , total records: 120k
j_df = i_df.join(p_df, i_df.productId == p_df.product_id) // total records 7000, but num of partitions is 200

first two dataframes have 4 partitions, but as soon as i join them it shows 200 partitions. I was expecting that it will make 4 partitions after joining, but why is it showing 200.

I am running it on local with conf.setIfMissing("spark.master", "local[4]")

Upvotes: 0

Views: 479

Answers (1)

zjffdu
zjffdu

Reputation: 28744

The 200 is the default shuffle partition size. you can change it by setting spark.sql.shuffle.partitions

Upvotes: 4

Related Questions