sercasti
sercasti

Reputation: 590

Why does a single vanilla DataFrame.count() cause 2 jobs to be executed by pyspark?

I'm trying to understand how spark transforms the logical execution plan to a physical execution plan

I do 2 things:

  1. read a csv file
  2. count over the dataframe

So I was expecting 2 jobs only to be executed by the DAG

Why is this creating 3 jobs total? enter image description here

and why did it need 3 different stages for this? enter image description here

Upvotes: 0

Views: 64

Answers (1)

sercasti
sercasti

Reputation: 590

I even went as far as removing the header from the file, and forcing inferSchema to disable, still 3 jobs: enter image description here

Upvotes: 0

Related Questions