Reputation: 487
I have got a requirement to read an image from S3 bucket and convert the same into base64 encoding format.
I am able to read the image file from S3 but when I am passing the S3 file path in base64 method it is not able to recognize the path.
So I thought I will save the image dataframe (same the image) in temp path in cluster and then pass the path in base64 method.
But while saving the image dataframe I am getting below error: (initially I tried to save the image dataframe with "image" format but in Google I found there is a bug with this format and someone suggested to use below format)
java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat.
Please see my sample code below and please tell me where I can find the dependent package
spark._jsc.hadoopConfiguration().set('fs.s3a.access.key', '************')
spark._jsc.hadoopConfiguration().set('fs.s3a.secret.key', '************')
spark._jsc.hadoopConfiguration().set('fs.s3a.endpoint', '************')
def getImageStr(img):
with open(img, "rb") as imageFile:
str1 = base64.b64encode(imageFile.read())
str2 = str(str1, 'utf-8')
return str2
img_df = spark.read\
.format("image")\
.load("s3a://xxx/yyy/zzz/hello.jpg")
img_df.printSchema()
img_df.write\
.format("org.apache.spark.ml.source.image.PatchedImageFileFormat")\
.save("/tmp/sample.jpg")
img_str = getImageStr("/tmp/sample.jpg")
print(img_str)
Please advice me if any other way it is possible to download image file from S3 in Spark (without using boto3 package)
Upvotes: 2
Views: 5537
Reputation: 87359
When you use image
data source, you'll get the dataframe with image
column, and there is a binary payload inside it - image.data
contains the actual image. Then you can use built-in function base64
to encode that column, and you can write encoded representation to the file. Something like this (not tested):
from pyspark.sql.functions import base64, col
img_df = spark.read.format("image").load("s3a://xxx/yyy/zzz/hello.jpg")
proc_df = img_df.select(base64(col("image.data")).alias('encoded')
proc_df.coalesce(1).write.format("text").save('/tmp/sample.jpg')
Upvotes: 2