Reputation: 219
I have a set of spark dataframe transforms which gives an out of memory error and has a messed up sql query plan while a different implemetation runs successfully.
%python
import pandas as pd
diction = {
'key': [1,2,3,4,5,6],
'f1' : [1,0,1,0,1,0],
'f2' : [0,1,0,1,0,1],
'f3' : [1,0,1,0,1,0],
'f4' : [0,1,0,1,0,1]}
bil = pd.DataFrame(diction)
# successfull logic
df = spark.createDataFrame(bil)
df = df.cache()
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
# failed logic
df = spark.createDataFrame(bil)
df = df.cache()
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
Logically thinking there must not be such a computational difference (more than double the time and memory used). Can anyone help me understand this ?
Upvotes: 0
Views: 123
Reputation: 3676
I'm not sure what your use case is for this code, however the two pieces of code are not logically the same. In the second version you are joining the result of the previous iteration to itself three times. In the first version you are joining a 'copy' of the original df three times. If your key column is not unique, the second piece of code will 'explode' your dataframe more than the first.
To make this more clear we can make a simple example below where we have a non-unique key value. Taking your second example:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 257
And your first piece of code:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 17
Upvotes: 1