Save Spark dataframe as parquet file in Google Cloud Storage

Question

I am trying to save a Spark dataframe to Google Cloud Storage. We are able to save the dataframe in parquet format to S3, but because our server is Google Compute Engine, there will be a huge data transfer cost to S3. I would like if it's possible to have similar function for Google Cloud Storage? Below is what I did in case of S3:

Add dependencies to build.sbt:

"net.java.dev.jets3t" % "jets3t" % "0.9.4",
"com.amazonaws" % "aws-java-sdk" % "1.10.16"

Use this in the main code:

val sc = new SparkContext(sparkConf)
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", conf.getString("s3.awsAccessKeyId"))
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", conf.getString("s3.awsSecretAccessKey"))

val df = sqlContext.read.parquet("s3a://.../*") //read file
df.write.mode(SaveMode.Append).parquet(s3FileName) //write file

And finally, use this with spark-submit

spark-submit --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem 
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem

I tried to look for a similar guide in the internet, but there doesn't seems to have one? Could any one please suggest me how I can get it done?

Thanks.

Save Spark dataframe as parquet file in Google Cloud Storage

Answers (1)

Related Questions