6:[["$","$Le",null,{}],["$","div",null,{"className":"min-h-screen bg-gray-100 p-6","children":[["$","$Lf",null,{}],["$","script",null,{"type":"application/ld+json","dangerouslySetInnerHTML":{"__html":"{\"@context\":\"https://schema.org\",\"@type\":\"QAPage\",\"mainEntity\":{\"@type\":\"Question\",\"name\":\"How to improve this Spark pipeline?\",\"text\":\"

Suppose I am joining a few Spark data frames like that:

\\n\\n

abcd = a.join(b, 'bid', 'inner')\\\\\\n        .join(c, 'cid', 'inner')\\\\\\n        .join(d, 'did', 'left')\\\\\\n        .distinct() \\nabcd.head() # takes 5-7 min.\\n

\\n\\n

The head invocation triggers the pipeline execution that takes 5-7 min. Does it have anything to do with those joins ? How would you make the pipeline faster ?

\\n\",\"author\":{\"@type\":\"Person\",\"name\":\"Michael\"},\"upvoteCount\":1,\"answerCount\":1,\"acceptedAnswer\":null}}"}}],["$","div",null,{"className":"bg-white shadow-md rounded-lg p-6 mb-6 relative","children":[["$","div",null,{"className":"absolute top-4 right-4 flex flex-wrap space-x-2","children":[["$","span","performance",{"className":"bg-blue-600 text-white text-sm px-3 py-1 rounded-full","children":["$","$L10",null,{"href":"/discussion/tag/performance/1","children":"performance"}]}],["$","span","apache-spark",{"className":"bg-blue-600 text-white text-sm px-3 py-1 rounded-full","children":["$","$L10",null,{"href":"/discussion/tag/apache-spark/1","children":"apache-spark"}]}],["$","span","pyspark",{"className":"bg-blue-600 text-white text-sm px-3 py-1 rounded-full","children":["$","$L10",null,{"href":"/discussion/tag/pyspark/1","children":"pyspark"}]}]]}],["$","div",null,{"className":"flex items-center mb-4","children":[["$","img",null,{"src":"https://www.gravatar.com/avatar/072e8612042bbdedfcdd3cf86cdfa24a?s=256&d=identicon&r=PG&f=y&so-version=2","alt":"Michael","className":"w-16 h-16 rounded-full border"}],["$","div",null,{"className":"ml-4","children":[["$","a",null,{"href":"https://stackoverflow.com/users/521070/michael","target":"_blank","rel":"noopener noreferrer","className":"text-lg font-semibold text-blue-600 hover:underline","children":"Michael"}],["$","p",null,{"className":"text-sm text-gray-500","children":["Reputation: ",42100]}]]}]]}],["$","h1",null,{"className":"text-2xl font-bold text-gray-800 mb-4","children":"How to improve this Spark pipeline?"}],["$","p",null,{"className":"text-gray-700 mt-4","dangerouslySetInnerHTML":{"__html":"

Suppose I am joining a few Spark data frames like that:

\n\n

abcd = a.join(b, 'bid', 'inner')\\\n        .join(c, 'cid', 'inner')\\\n        .join(d, 'did', 'left')\\\n        .distinct() \nabcd.head() # takes 5-7 min.\n

\n\n

The head invocation triggers the pipeline execution that takes 5-7 min. Does it have anything to do with those joins ? How would you make the pipeline faster ?

\n"}}],["$","div",null,{"className":"text-gray-600 text-sm mt-4","children":[["$","p",null,{"children":["Upvotes: ",1]}],["$","p",null,{"children":["Views: ",56]}]]}]]}],["$","div",null,{"className":"container mx-auto","children":[["$","h2",null,{"className":"text-2xl font-semibold text-gray-800 mb-6","children":["Answers (",1,")"]}],[["$","div","50334813",{"className":"bg-white shadow-md rounded-lg p-6 mb-6","children":[["$","div",null,{"className":"flex items-center mb-4","children":[["$","img",null,{"src":"https://www.gravatar.com/avatar/90743ad0e0e0ce4454d65b2ad2133467?s=256&d=identicon&r=PG&f=y&so-version=2","alt":"vvg","className":"w-12 h-12 rounded-full border"}],["$","div",null,{"className":"ml-4","children":[["$","a",null,{"href":"https://stackoverflow.com/users/3641023/vvg","target":"_blank","rel":"noopener noreferrer","className":"text-lg font-semibold text-blue-600 hover:underline","children":"vvg"}],["$","p",null,{"className":"text-sm text-gray-500","children":["Reputation: ",6385]}]]}]]}],["$","p",null,{"className":"text-gray-700 mb-4","dangerouslySetInnerHTML":{"__html":"

head() returns just one record.\nYou don't need distinct(), if you need just first record.\nIt might save you from expensive shuffle.

\n\n

However, considering you have joins above, and resulted dataset is not sorted - there are no guarantees what record will be returned.

\n"}}],["$","div",null,{"className":"text-gray-600 text-sm","children":["$","p",null,{"children":["Upvotes: ",1]}]}]]}]]]}],["$","div",null,{"className":"bg-white shadow-md rounded-lg p-6 mt-6","children":[["$","h2",null,{"className":"text-2xl font-semibold text-gray-800 mb-4","children":"Related Questions"}],["$","ul",null,{"className":"list-disc list-inside","children":[["$","li","61284118",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/61284118","className":"text-blue-600 hover:underline","children":"Optimising Spark read and write performance"}]}],["$","li","27757117",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/27757117","className":"text-blue-600 hover:underline","children":"Spark Python Performance Tuning"}]}],["$","li","72640110",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/72640110","className":"text-blue-600 hover:underline","children":"pyspark performance and processing time"}]}],["$","li","71203271",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/71203271","className":"text-blue-600 hover:underline","children":"Why Spark processing takes longer?"}]}],["$","li","71069103",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/71069103","className":"text-blue-600 hover:underline","children":"Pyspark Pipeline Performance"}]}],["$","li","63638502",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/63638502","className":"text-blue-600 hover:underline","children":"how to improve performance in pyspark joins"}]}],["$","li","49947159",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/49947159","className":"text-blue-600 hover:underline","children":"Optimize Pyspark code to run fast"}]}],["$","li","40502185",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/40502185","className":"text-blue-600 hover:underline","children":"Spark program takes a really long time to complete execution"}]}],["$","li","40089822",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/40089822","className":"text-blue-600 hover:underline","children":"optimization for processing big data in pyspark"}]}],["$","li","31693723",{"className":"mb-2","children":["$","$L10",null,{"href":"/discussion/solution/31693723","className":"text-blue-600 hover:underline","children":"Spark query running very slow"}]}]]}]]}]]}],["$","$L11",null,{}],["$","$L12",null,{}],["$","$L13",null,{}],["$","$L14",null,{}],["$","$L15",null,{}]]

How to improve this Spark pipeline?

Answers (1)

Related Questions