Atanu chatterjee
Atanu chatterjee

Reputation: 487

Reading and Saving Image File in Pyspark

I have got a requirement to read an image from S3 bucket and convert the same into base64 encoding format.

I am able to read the image file from S3 but when I am passing the S3 file path in base64 method it is not able to recognize the path.

So I thought I will save the image dataframe (same the image) in temp path in cluster and then pass the path in base64 method.

But while saving the image dataframe I am getting below error: (initially I tried to save the image dataframe with "image" format but in Google I found there is a bug with this format and someone suggested to use below format)

java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat.

Please see my sample code below and please tell me where I can find the dependent package

spark._jsc.hadoopConfiguration().set('fs.s3a.access.key', '************')
spark._jsc.hadoopConfiguration().set('fs.s3a.secret.key', '************')
spark._jsc.hadoopConfiguration().set('fs.s3a.endpoint', '************')

def getImageStr(img):
  with open(img, "rb") as imageFile:
     str1 = base64.b64encode(imageFile.read())
     str2 = str(str1, 'utf-8')
  return str2

img_df = spark.read\
  .format("image")\
  .load("s3a://xxx/yyy/zzz/hello.jpg")

img_df.printSchema()


img_df.write\
    .format("org.apache.spark.ml.source.image.PatchedImageFileFormat")\
    .save("/tmp/sample.jpg")

img_str = getImageStr("/tmp/sample.jpg")

print(img_str)

Please advice me if any other way it is possible to download image file from S3 in Spark (without using boto3 package)

Upvotes: 2

Views: 5537

Answers (1)

Alex Ott
Alex Ott

Reputation: 87359

When you use image data source, you'll get the dataframe with image column, and there is a binary payload inside it - image.data contains the actual image. Then you can use built-in function base64 to encode that column, and you can write encoded representation to the file. Something like this (not tested):

from pyspark.sql.functions import base64, col
img_df = spark.read.format("image").load("s3a://xxx/yyy/zzz/hello.jpg")
proc_df = img_df.select(base64(col("image.data")).alias('encoded')
proc_df.coalesce(1).write.format("text").save('/tmp/sample.jpg')

Upvotes: 2

Related Questions