Vignesh D
Vignesh D

Reputation: 219

Different outcome from seemingly equivalent implementation of PySpark transformations

I have a set of spark dataframe transforms which gives an out of memory error and has a messed up sql query plan while a different implemetation runs successfully.

%python
import pandas as pd
diction = {
    'key': [1,2,3,4,5,6],
    'f1' : [1,0,1,0,1,0],
    'f2' : [0,1,0,1,0,1],
    'f3' : [1,0,1,0,1,0],
    'f4' : [0,1,0,1,0,1]}
bil = pd.DataFrame(diction)
# successfull logic
df = spark.createDataFrame(bil)
df = df.cache()
zdf = df
for i in [1,2,3]:
  tempdf = zdf.select(['key'])
  df = df.join(tempdf,on=['key'],how='left')
df.show()
# failed logic
df = spark.createDataFrame(bil)
df = df.cache()
for i in [1,2,3]:
  tempdf = df.select(['key'])
  df = df.join(tempdf,on=['key'],how='left')
df.show()

Logically thinking there must not be such a computational difference (more than double the time and memory used). Can anyone help me understand this ?

DAG of successful logic: successful

DAG of failure logic: failure

Upvotes: 0

Views: 123

Answers (1)

ScootCork
ScootCork

Reputation: 3676

I'm not sure what your use case is for this code, however the two pieces of code are not logically the same. In the second version you are joining the result of the previous iteration to itself three times. In the first version you are joining a 'copy' of the original df three times. If your key column is not unique, the second piece of code will 'explode' your dataframe more than the first.

To make this more clear we can make a simple example below where we have a non-unique key value. Taking your second example:

df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
for i in [1,2,3]:
  tempdf = df.select(['key'])
  df = df.join(tempdf,on=['key'],how='left')
df.count()

>>> 257

And your first piece of code:

df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
zdf = df
for i in [1,2,3]:
  tempdf = zdf.select(['key'])
  df = df.join(tempdf,on=['key'],how='left')
df.count()

>>> 17

Upvotes: 1

Related Questions