user1086102
user1086102

Reputation: 33

How to setup google cloud storage correctly for spark application using aws data pipeline

I am setting up the cluster step to run a spark application using Amazon Data Pipeline. My job is to read data from S3, process the data and write data to google cloud storage. For google cloud storage, I am using the service account with key file. However, it complains about the key file is NOT FOUND at the "write" step. I tried so many ways, and non of them work. And the app runs fine if it is kicked off without data pipeline.

Here are what I tried:

google.cloud.auth.service.account.json.keyfile = "/home/hadoop/gs_test.json"

command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,/home/hadoop/app.jar,s3://myBucket/app.conf

google.cloud.auth.service.account.json.keyfile = "/home/hadoop/gs_test.json"

command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,--files,/home/hadoop/gs_test.json, /home/hadoop/app.jar,s3://myBucket/app.conf

google.cloud.auth.service.account.json.keyfile = "gs_test.json"

command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,--files,/home/hadoop/gs_test.json#gs_test.json, /home/hadoop/app.jar,s3://myBucket/app.conf

Here is the error:

java.io.FileNotFoundException: /home/hadoop/gs_test.p12 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:670)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount(CredentialFactory.java:234)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:90)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:113)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:88)
at org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter.<init>(DirectFileOutputCommitter.java:31)
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputCommitter(FileOutputFormat.java:310)
at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:36)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:146)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:246)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Any idea how to setup google cloud storage correctly for spark application using aws data pipeline? Thank you so much for your help.

Upvotes: 3

Views: 1074

Answers (1)

Alan Borsato
Alan Borsato

Reputation: 256

If I understood it well: you want to use GCS (gs:// type URLs) in your Spark jobs outside Dataproc.

In that case you will have to install the GCS connector that will make gs:// url mapper available: https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/README.md

Installation and set up instructions in the Github link above.

Upvotes: 0

Related Questions