SB07
SB07

Reputation: 76

Pyspark join getting failed on Dataproc

I am trying to run some python pyspark script on Dataproc cluster but getting failed with below error:

File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 815, in join 
if isinstance(on[0], basestring): 
IndexError: list index out of range

The syntax, I am using in my code is: -

df1.join(df2, col1)

Any ideas?

Upvotes: 0

Views: 92

Answers (1)

Dennis Huo
Dennis Huo

Reputation: 10677

Looking at the code, on is the "col1" argument you're passing in, and the code in Spark assumes that if on is not None that it definitely has at least one element. Is it possible that you're accidentally passing in an empty array for col1? Perhaps you can print out col1 before calling join to make sure.

Upvotes: 1

Related Questions