Save image file to HDFS using Spark

Question

I have an image file

image = JavaSparkContext.binaryFiles("/path/to/image.jpg");

I would like to process then save the binary info using Spark to HDFS

Something like :

image.saveAsBinaryFile("hdfs://cluster:port/path/to/image.jpg")

Is this possible, not saying 'as simple', just possible to do this? if so how would you do this. Trying to keep a one to one if possible as in keeping the extension and type, so if I directly download using hdfs command line it would still be a viable image file.

Mo Tao · Accepted Answer

Yes, it is possible. But you need some data serialization plugin, for example avro(https://github.com/databricks/spark-avro).

Assume image is presented as binary(byte[]) in your program, so the images can be a Dataset. You can save it using

datasetOfImages.write()
  .format("com.databricks.spark.avro")
  .save("hdfs://cluster:port/path/to/images.avro");

images.avro would be a folder contains multiple partitions and each partition would be an avro file saving some images.

Edit:

it is also possible but not recommended to save the images as separated files. You can call foreach on the dataset and use HDFS api to save the image.

see below for a piece of code written in Scala. You should be able to translate it into Java.

import org.apache.hadoop.fs.{FileSystem, Path}

datasetOfImages.foreachPartition { images =>
  val fs = FileSystem.get(sparkContext.hadoopConfiguration)
  images.foreach { image =>
    val out = fs.create(new Path("/path/to/this/image"))
    out.write(image);
    out.close();
  }
}

Save image file to HDFS using Spark

Answers (1)

Related Questions