Reputation: 2493
I am trying to save a Spark dataframe to Google Cloud Storage. We are able to save the dataframe in parquet format to S3, but because our server is Google Compute Engine, there will be a huge data transfer cost to S3. I would like if it's possible to have similar function for Google Cloud Storage? Below is what I did in case of S3:
Add dependencies to build.sbt:
"net.java.dev.jets3t" % "jets3t" % "0.9.4",
"com.amazonaws" % "aws-java-sdk" % "1.10.16"
Use this in the main code:
val sc = new SparkContext(sparkConf)
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", conf.getString("s3.awsAccessKeyId"))
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", conf.getString("s3.awsSecretAccessKey"))
val df = sqlContext.read.parquet("s3a://.../*") //read file
df.write.mode(SaveMode.Append).parquet(s3FileName) //write file
And finally, use this with spark-submit
spark-submit --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem
I tried to look for a similar guide in the internet, but there doesn't seems to have one? Could any one please suggest me how I can get it done?
Thanks.
Upvotes: 5
Views: 3813
Reputation: 2493
In case someone wants to do the same thing, I got this working as follows:
Add library dependency to SBT:
"com.google.cloud.bigdataoss" % "gcs-connector" % "1.4.2-hadoop2"
Set the Hadoop configuration:
sc.hadoopConfiguration.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
sc.hadoopConfiguration.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
sc.hadoopConfiguration.set("fs.gs.project.id", conf.getString("gcs.projectId"))
sc.hadoopConfiguration.set("google.cloud.auth.service.account.enable", "true")
sc.hadoopConfiguration.set("google.cloud.auth.service.account.email", conf.getString("gcs.serviceAccountEmail"))
sc.hadoopConfiguration.set("google.cloud.auth.service.account.keyfile", conf.getString("gcs.serviceAccountKeyFile"))
Then you can save and read the file like you would for S3. The only thing is that it's not working with Spark 1.4 at the time I tested, so you might want to update it to Spark 1.5+ instead.
Upvotes: 2