How to setup google cloud storage correctly for spark application using aws data pipeline

Question

I am setting up the cluster step to run a spark application using Amazon Data Pipeline. My job is to read data from S3, process the data and write data to google cloud storage. For google cloud storage, I am using the service account with key file. However, it complains about the key file is NOT FOUND at the "write" step. I tried so many ways, and non of them work. And the app runs fine if it is kicked off without data pipeline.

Here are what I tried:

google.cloud.auth.service.account.json.keyfile = "/home/hadoop/gs_test.json"

command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,/home/hadoop/app.jar,s3://myBucket/app.conf

google.cloud.auth.service.account.json.keyfile = "/home/hadoop/gs_test.json"

command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,--files,/home/hadoop/gs_test.json, /home/hadoop/app.jar,s3://myBucket/app.conf

google.cloud.auth.service.account.json.keyfile = "gs_test.json"

command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,--files,/home/hadoop/gs_test.json#gs_test.json, /home/hadoop/app.jar,s3://myBucket/app.conf

Here is the error:

java.io.FileNotFoundException: /home/hadoop/gs_test.p12 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:670)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount(CredentialFactory.java:234)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:90)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.(FileOutputCommitter.java:113)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.(FileOutputCommitter.java:88)
at org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter.(DirectFileOutputCommitter.java:31)
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputCommitter(FileOutputFormat.java:310)
at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:36)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:146)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:246)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Any idea how to setup google cloud storage correctly for spark application using aws data pipeline? Thank you so much for your help.

How to setup google cloud storage correctly for spark application using aws data pipeline

Answers (1)

Related Questions