Different outcome from seemingly equivalent implementation of PySpark transformations

Question

I have a set of spark dataframe transforms which gives an out of memory error and has a messed up sql query plan while a different implemetation runs successfully.

%python
import pandas as pd
diction = {
    'key': [1,2,3,4,5,6],
    'f1' : [1,0,1,0,1,0],
    'f2' : [0,1,0,1,0,1],
    'f3' : [1,0,1,0,1,0],
    'f4' : [0,1,0,1,0,1]}
bil = pd.DataFrame(diction)
# successfull logic
df = spark.createDataFrame(bil)
df = df.cache()
zdf = df
for i in [1,2,3]:
  tempdf = zdf.select(['key'])
  df = df.join(tempdf,on=['key'],how='left')
df.show()
# failed logic
df = spark.createDataFrame(bil)
df = df.cache()
for i in [1,2,3]:
  tempdf = df.select(['key'])
  df = df.join(tempdf,on=['key'],how='left')
df.show()

Logically thinking there must not be such a computational difference (more than double the time and memory used). Can anyone help me understand this ?

DAG of successful logic:

DAG of failure logic:

Different outcome from seemingly equivalent implementation of PySpark transformations

Answers (1)

Related Questions