Using Spark on Dataproc, how to write to GCS separately from each partition?

Question

Using Spark on GCP Dataproc, I successfuly write an entire RDD to GCS like so:

rdd.saveAsTextFile(s"gs://$path")

The products are files for each partition in the same path.

How do I write files for each partition (with a unique path based on information from the partition)

Below is an invented non working wishful code example

    rdd.mapPartitionsWithIndex(
      (i, partition) =>{

        partition.write(path = s"gs://partition_$i", data = partition_specific_data)
      }
    )

when I call the function below from within the partition on my mac it writes to local disk, on Dataproc I get an error not recognizing the gs as a valid path.

def writeLocally(filePath: String, data: Array[Byte], errorMessage: String): Unit = {

println("Juicy Platform")

val path = new Path(filePath)

var ofos: Option[FSDataOutputStream] = null

try {

  println(s"
Trying to write to $filePath
")

  val conf = new Configuration()

  conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
  conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

  //      conf.addResource(new Path("/home/hadoop/conf/core-site.xml"))


  println(conf.toString)

  val fs = FileSystem.get(conf)

  val fos = fs.create(path)
  ofos = Option(fos)

  fos.write(data)

  println(s"
Wrote to $filePath
")
}
catch {
  case e: Exception =>

    logError(errorMessage, s"Exception occurred writing to GCS:
${ExceptionUtils.getStackTrace(e)}")
}
finally {
  ofos match {
    case Some(i) => i.close()
    case _ =>
  }
}
  }

This is the error:

java.lang.IllegalArgumentException: Wrong FS: gs://path/myFile.json, expected: hdfs://cluster-95cf-m

Using Spark on Dataproc, how to write to GCS separately from each partition?

Answers (1)

Related Questions