Reputation: 687
I have an EMR (emr-5.28.0) with Spark 2.4.4 and python 2.7.16.
If I ssh to the cluster and execute pyspark like this:
pyspark --jars /home/hadoop/jar/spark-redshift_2.11-2.0.1.jar,/home/hadoop/jar/spark-avro_2.11-4.0.0.jar,/home/hadoop/jar/minimal-json-0.9.5.jar,/usr/share/aws/redshift/jdbc/RedshiftJDBC.jar --packages org.apache.spark:spark-avro_2.11:2.4.4
and execute this code:
url = "jdbc:redshift://my.cluster:5439/my_db?user=my_user&password=my_password"
query = "select * from schema.table where trunc(timestamp)='2019-09-10'"
df = sqlContext.read.format('com.databricks.spark.redshift')\
.option("url", url)\
.option("tempdir", "s3a://bucket/tmp_folder")\
.option("query", query)\
.option("aws_iam_role", "arn_iam_role")\
.load()
Everything works fine and I can work with that df. But if I open a Zeppelin notebook in the same EMR, with the same version of everything and execute a cell with:
%dep
z.load("/home/hadoop/jar/spark-redshift_2.11-2.0.1.jar")
z.load("/home/hadoop/jar/spark-avro_2.11-4.0.0.jar")
z.load("/home/hadoop/jar/minimal-json-0.9.5.jar")
z.load("/usr/share/aws/redshift/jdbc/RedshiftJDBC.jar")
z.load("org.apache.spark:spark-avro_2.11:2.4.4")
and in the next cell the same piece of code (startint with %pyspark
), when I try to do a df.count() I get the following error:
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
I've tried to restart the interpreter several times, and I've tried to add to the interpreter args the --jar
options that I use in the console when I ssh, but no luck.
Any ideas??
Upvotes: 0
Views: 679
Reputation: 8269
I think this is an issue with how z.load works (or rather, doesn't work) for Pyspark queries.
Instead of loading your dependencies this way, go to settings -> interpreters, find pyspark and load your dependencies there, then restart the interpreter. This is the 'Zeppelin version' of --jars
Here's the official docs link to this - https://zeppelin.apache.org/docs/0.6.2/manual/dependencymanagement.html
I know that for Spark SQL z.deps doesn't work, so this may be the same issue.
Upvotes: 1