6:[["$","$Le",null,{}],["$","div",null,{"className":"min-h-screen bg-gray-100 p-6","children":[["$","$Lf",null,{}],["$","script",null,{"type":"application/ld+json","dangerouslySetInnerHTML":{"__html":"{\"@context\":\"https://schema.org\",\"@type\":\"QAPage\",\"mainEntity\":{\"@type\":\"Question\",\"name\":\"Why does a single vanilla DataFrame.count() cause 2 jobs to be executed by pyspark?\",\"text\":\"

I'm trying to understand how spark transforms the logical execution plan to a physical execution plan

\\n

I do 2 things:

\\n

read a csv file
count over the dataframe

\\n

So I was expecting 2 jobs only to be executed by the DAG

\\n

Why is this creating 3 jobs total?\\n $\\\"enter$

\\n

and why did it need 3 different stages for this?\\n $\\\"enter$

\\n\",\"author\":{\"@type\":\"Person\",\"name\":\"sercasti\"},\"upvoteCount\":0,\"answerCount\":1,\"acceptedAnswer\":null}}"}}],["$","div",null,{"className":"bg-white shadow-md rounded-lg p-6 mb-6 relative","children":[["$","div",null,{"className":"absolute top-4 right-4 flex flex-wrap space-x-2","children":[["$","span","apache-spark",{"className":"bg-blue-600 text-white text-sm px-3 py-1 rounded-full","children":["$","$L10",null,{"href":"/discussion/tag/apache-spark/1","children":"apache-spark"}]}],["$","span","pyspark",{"className":"bg-blue-600 text-white text-sm px-3 py-1 rounded-full","children":["$","$L10",null,{"href":"/discussion/tag/pyspark/1","children":"pyspark"}]}],["$","span","apache-spark-sql",{"className":"bg-blue-600 text-white text-sm px-3 py-1 rounded-full","children":["$","$L10",null,{"href":"/discussion/tag/apache-spark-sql/1","children":"apache-spark-sql"}]}]]}],["$","div",null,{"className":"flex items-center mb-4","children":[["$","img",null,{"src":"https://i.sstatic.net/wJvzh.jpg?s=256","alt":"sercasti","className":"w-16 h-16 rounded-full border"}],["$","div",null,{"className":"ml-4","children":[["$","a",null,{"href":"https://stackoverflow.com/users/2532791/sercasti","target":"_blank","rel":"noopener noreferrer","className":"text-lg font-semibold text-blue-600 hover:underline","children":"sercasti"}],["$","p",null,{"className":"text-sm text-gray-500","children":["Reputation: ",590]}]]}]]}],["$","h1",null,{"className":"text-2xl font-bold text-gray-800 mb-4","children":"Why does a single vanilla DataFrame.count() cause 2 jobs to be executed by pyspark?"}],["$","p",null,{"className":"text-gray-700 mt-4","dangerouslySetInnerHTML":{"__html":"

I'm trying to understand how spark transforms the logical execution plan to a physical execution plan

I do 2 things:

read a csv file
count over the dataframe

So I was expecting 2 jobs only to be executed by the DAG

Why is this creating 3 jobs total?\n $\"enter$

and why did it need 3 different stages for this?\n $\"enter$

\n"}}],["$","div",null,{"className":"text-gray-600 text-sm mt-4","children":[["$","p",null,{"children":["Upvotes: ",0]}],["$","p",null,{"children":["Views: ",64]}]]}]]}],["$","div",null,{"className":"container mx-auto","children":[["$","h2",null,{"className":"text-2xl font-semibold text-gray-800 mb-6","children":["Answers (",1,")"]}],[["$","div","75789973",{"className":"bg-white shadow-md rounded-lg p-6 mb-6","children":[["$","div",null,{"className":"flex items-center mb-4","children":[["$","img",null,{"src":"https://i.sstatic.net/wJvzh.jpg?s=256","alt":"sercasti","className":"w-12 h-12 rounded-full border"}],["$","div",null,{"className":"ml-4","children":[["$","a",null,{"href":"https://stackoverflow.com/users/2532791/sercasti","target":"_blank","rel":"noopener noreferrer","className":"text-lg font-semibold text-blue-600 hover:underline","children":"sercasti"}],["$","p",null,{"className":"text-sm text-gray-500","children":["Reputation: ",590]}]]}]]}],["$","p",null,{"className":"text-gray-700 mb-4","dangerouslySetInnerHTML":{"__html":"

I even went as far as removing the header from the file, and forcing inferSchema to disable, still 3 jobs:\n $\"enter$

\n"}}],["$","div",null,{"className":"text-gray-600 text-sm","children":["$","p",null,{"children":["Upvotes: ",0]}]}]]}]]]}],["$","div",null,{"className":"bg-white shadow-md rounded-lg p-6 mt-6","children":[["$","h2",null,{"className":"text-2xl font-semibold text-gray-800 mb-4","children":"Related Questions"}],["$","ul",null,{"className":"list-disc list-inside","children":[["$","li","37147032",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/37147032","className":"text-blue-600 hover:underline","children":"Why does df.limit keep changing in Pyspark?"}]}],["$","li","37528047",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/37528047","className":"text-blue-600 hover:underline","children":"How are stages split into tasks in Spark?"}]}],["$","li","65978888",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/65978888","className":"text-blue-600 hover:underline","children":"Different outcome from seemingly equivalent implementation of PySpark transformations"}]}],["$","li","62611515",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/62611515","className":"text-blue-600 hover:underline","children":"Spark SQL : Why am I seeing 3 jobs instead of one single job in the Spark UI?"}]}],["$","li","32464122",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/32464122","className":"text-blue-600 hover:underline","children":"Spark performance for Scala vs Python"}]}],["$","li","58730826",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/58730826","className":"text-blue-600 hover:underline","children":"Why spark count action has executed in three stages"}]}],["$","li","58565068",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/58565068","className":"text-blue-600 hover:underline","children":"Why does Spark run 5 jobs for a simple aggregation?"}]}],["$","li","49385724",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/49385724","className":"text-blue-600 hover:underline","children":"How to know the number of Spark jobs and stages in (broadcast) join query?"}]}]]}]]}]]}],["$","$L11",null,{}],["$","$L12",null,{}],["$","$L13",null,{}],["$","$L14",null,{}],["$","$L15",null,{}]]

Why does a single vanilla DataFrame.count() cause 2 jobs to be executed by pyspark?

Answers (1)

Related Questions