Reputation: 93
Im doing a Spark program that reads and writes from Amazon S3.My problem is that It works if I execute in local mode (--master local[6]) but if i execute in the cluster (in other machines) I got an error with the credentials:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 33, mmdev02.stratio.com): com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:384)
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:155)
at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
My code is as follows:
val conf = new SparkConf().setAppName("BackupS3")
val sc = SparkContext.getOrCreate(conf)
sc.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-" + region + ".amazonaws.com")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.buffer.dir", "/var/tmp/spark")
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true");
System.setProperty("com.amazonaws.services.s3.enableV4", "true")
I can write to Amazon S3 but cannot read! I also had to send some properties when I do spark-submit because my region is Frankfurt and I had to enable V4:
--conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
I tried passing the credentials this way too. If i put them in the hdfs-site.xml in every machine it works.
My question, is how can I do it from code? Why are the executors not getting the config i pass them from the code?
I'm using Spark 1.5.2, hadoop-aws 2.7.1 and aws-java-sdk 1.7.4.
Thanks
Upvotes: 0
Views: 3833
Reputation: 18770
If you set these properties explicitly in your code, the values will only be visible to the driver process. The executors will not have a chance to pick up those credentials.
If you had set them in actual config file like core-site.xml
, they will propagate.
Your code would work in local mode because all operations are happening in a single process.
Why it works on a cluster on small files but not large ones (*): The code could also work on unpartitioned files, where read operations are performed in the driver and partitions are then broadcast to executors. On partitioned files, where executors read individual partitions, the credentials won't be set on the executors so it fails.
Best to use standard mechanisms for passing credentials, or better yet, use EC2 roles and IAM policies in your cluster as EricJ's answer suggests. By default, if you do not provide credentials, EMRFS will look up temporary credentials via EC2 instance metadata service.
(*) I am still learning about this myself, and I may need to revise this answer as I learn more
Upvotes: 0
Reputation: 13430
AWS_
env vars before submitting the work. Upvotes: 3