hasan.alkhatib
hasan.alkhatib

Reputation: 1559

What is the difference between SPARK Partitions and Worker Cores?

I used the Standalone Spark Cluster to process several files. When I executed the Driver, the data was processed on each worker using it's cores.

Now, I've read about Partitions, but I didn't get it if it's different than Worker Cores or not.

Is there a difference between setting cores number and partition numbers?

Upvotes: 5

Views: 8183

Answers (2)

rakesh
rakesh

Reputation: 2051

Simplistic view: Partition vs Number of Cores

When you invoke an action an RDD,

  • A "Job" is created for it. So, Job is a work submitted to spark.
  • Jobs are divided in to "STAGE" based on the shuffle boundary!!!
  • Each stage is further divided to tasks based on the number of partitions on the RDD. So Task is smallest unit of work for spark.
  • Now, how many of these tasks can be executed simultaneously depends on the "Number of Cores" available!!!

Upvotes: 14

Tim
Tim

Reputation: 3725

Partition (or task) refers to a unit of work. If you have a 200G hadoop file loaded as an RDD and chunked by 128M (Spark default), then you have ~2000 partitions in this RDD. The number of cores determines how many partitions can be processed at any one time, and up to 2000 (capped at the number of partitions/tasks) can execute this RDD in parallel.

Upvotes: 4

Related Questions