Reputation: 2011
I am trying to read redshift table data into red-shift data frame and writing that dataframe in another redshift table. Using following .jar in spark_submit for this task.
Here is the command:
spark-submit --jars RedshiftJDBC41-1.2.12.1017.jar,minimal-json-0.9.4.jar,spark-avro_2.11-3.0.0.jar,spark-redshift_2.10-2.0.0.jar,aws-java-sdk-sqs-1.11.694.jar,aws-java-sdk-s3-1.11.694.jar,aws-java-sdk-core-1.11.694.jar --packages org.apache.hadoop:hadoop-aws:2.7.4 t.py
I tried changing the version of all the jar and hadoop-aws version as well accordingly as mentioned in various stackoverflow answer with no luck.
Traceback (most recent call last):
File "/home/ubuntu/trell-ds-framework/data_engineering/data_migration/t.py", line 21, in <module>
.option("tempdir", "s3a://AccessKey:AccessSecret@big-query-to-rs/rs_temp_data") \
File "/home/ubuntu/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load
File "/home/ubuntu/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/home/ubuntu/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/home/ubuntu/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.load.
: java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at com.databricks.spark.redshift.Utils$.assertThatFileSystemIsNotS3BlockFileSystem(Utils.scala:124)
at com.databricks.spark.redshift.RedshiftRelation.<init>(RedshiftRelation.scala:52)
at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:49)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
2019-12-21 14:38:03 INFO SparkContext:54 - Invoking stop() from shutdown hook
2019-12-21 14:38:03 INFO AbstractConnector:318 - Stopped Spark@3c115b0a{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-12-21 14:38:03 INFO SparkUI:54 - Stopped Spark web UI at http://ip-172-30-1-193.ap-south-1.compute.internal:4040
2019-12-21 14:38:03 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
Can any body help me out here about what could be the issue? Is it library issue of .jar or hadoop or something else?
Thanks.
Upvotes: 2
Views: 3165
Reputation: 13440
as the s3a troubleshooting docs say
Do not attempt to “drop in” a newer version of the AWS SDK than that which the Hadoop version was built with Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see
stop using 8 year old hadoop 2.7 binaries, upgrade to a version of spark with 3.3.1 binaries and try again
Upvotes: 0
Reputation: 5588
For my case I was using the s3a connector and pyspark on SageMaker with Hadoop 2.7.3 jars. To fix it, I needed to install the following jars hadoop-aws-2.7.3.jar
and aws-java-sdk-1.7.4.jar
. This resolved my issue instead of using the documented aws-java-sdk-bundle
jar.
My code:
builder = SparkSession.builder.appName('appName')
builder = builder.config('spark.jars', 'hadoop-aws-2.7.3.jar,aws-java-sdk-1.7.4.jar')
spark = builder.getOrCreate()
Upvotes: 1
Reputation: 3914
I had the same issue while using spark-shell.
The issue was that the jar was loaded at the end of all jars, I moved it at the beginning and the issue was resolved.
Upvotes: 0
Reputation: 342
I ran into a similar kind of error few days back.
I could able to resolve by upgrading the 'aws-java-sdk-s3' jar to the latest version.
Upvotes: 1