Reputation: 1325
I have a csv file that I have read into Pandas. One of the columns in the csv contains a base64 encoded value but it gets read in by Pandas as a string. How would I go about converting this value (now read in as a string) back to a useable base64 value. The structure looks like this.
I have an example here:
asset_id,asset_name,file_extension,concept_name,image_byte
204863410,7613287394927_H_enUK_1634104697919.jpg,jpg,Nestle Confectionery:Hazelnut,
I have a few more samples loaded up in this file for recreating the error here.
UPDATE:
As a commenter pointed out that the string I am trying to convert is already a base64 and that is indeed correct. That is the problem!
The string is already base64 encoded but I am not sure why it is still getting rejected. I think that it might be trying to read the string in with the ' at the beginning and end and failing. When I try loading the string as is into the API call as such, I get this
image=resources_pb2.Image(
TypeError: '/9j/4AAQSkZJRgABAQIAHAAcAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQFxQYGBcUFhYaHSUfGhsjHBYWICwgIyY has type str, but expected one of: bytes
This is the code block I am using
post_inputs_response = stub.PostInputs(
service_pb2.PostInputsRequest(
inputs=[
resources_pb2.Input(
data=resources_pb2.Data(
image=resources_pb2.Image(
base64= img_byte_raw
)
)
)
]
),
metadata=metadata
)
which I got from Clarifai's doc linked here
I will appreciate any help
Upvotes: 3
Views: 1935
Reputation: 177911
In your linked Book1.csv of four rows, the first two entries are valid base64-encoded, but the last two are not. The strings in the last rows are very long and may have been truncated at some point. The TypeError
you received indicates the base64-encoded string data needs to be converted back to bytes
objects.
Below is an example converting the base64 strings to bytes objects. I used Pillow (pip install Pillow
) to display the images to verify that they were indeed decoded correctly:
import pandas as pd
import base64
from PIL import Image
from io import BytesIO
def decode(s):
try:
return base64.b64decode(s)
except ValueError as e:
return e
df = pd.read_csv(r'downloads\book1.csv',encoding='utf-8-sig')
df['image_byte'] = df['image_byte'].apply(decode)
print(df)
Image.open(BytesIO(df.image_byte[0])).show()
Image.open(BytesIO(df.image_byte[1])).show()
Output:
asset_id ... image_byte
0 204863410 ... b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x02...
1 204863409 ... b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x02...
2 204863134 ... Incorrect padding
3 204863133 ... Incorrect padding
[4 rows x 5 columns]
The first image:
Upvotes: 5
Reputation: 2359
Ok I think I've found out your issue. Let's download a fresh file to use as an example and work with that. I'm working with a small snippet of a gray square.
import cv2
from base64 import b64encode, b64decode
import numpy as np
# Load image, b64 encode, byte decode to string
img = cv2.imread(r'C:\Users\me\Pictures\koala.png', cv2.IMREAD_GRAYSCALE)
encoded_img = b64encode(img)
image_b64_str = encoded_img.decode("utf-8")
# Read the string back, encode into bytes, then b64 decode
image_b64_in = image_b64_str.encode("utf-8")
base64_decoded_image = b64decode(image_b64_in)
decodeed_img_from_string_only = np.frombuffer(image_b64_in, dtype=np.uint8)
decodeed_img_from_b64_decode = np.frombuffer(base64_decoded_image, dtype=np.uint8)
print(img)
print(f"b64 encoded: {encoded_img}")
print(f"b64 encoded then string decoded: {image_b64_str}")
print('')
print(f"String encoded to bytes: {image_b64_in}")
print(f"Bytes decoded to array: {decodeed_img_from_string_only}")
print('')
print(f"String encoded to bytes and then b64 decoded: {base64_decoded_image}")
print(f"B64-decoded bytes decoded to array: {decodeed_img_from_b64_decode}")
Here's the output from the above block:
[[181 182 182 182 181 182 186]
[181 182 182 182 182 183 186]
[182 182 183 183 184 185 186]
[182 182 183 184 185 186 186]
[182 182 182 183 185 186 187]
[180 181 181 182 184 185 187]
[178 179 181 181 182 184 188]
[177 179 180 180 181 183 188]]
b64 encoded: b'tba2trW2urW2tra2t7q2tre3uLm6tra3uLm6ura2tre5uru0tbW2uLm7srO1tba4vLGztLS1t7w='
b64 encoded then string decoded: tba2trW2urW2tra2t7q2tre3uLm6tra3uLm6ura2tre5uru0tbW2uLm7srO1tba4vLGztLS1t7w=
String encoded to bytes: b'tba2trW2urW2tra2t7q2tre3uLm6tra3uLm6ura2tre5uru0tbW2uLm7srO1tba4vLGztLS1t7w='
Bytes decoded to array: [116 98 97 50 116 114 87 50 117 114 87 50 116 114 97 50 116 55
113 50 116 114 101 51 117 76 109 54 116 114 97 51 117 76 109 54
117 114 97 50 116 114 101 53 117 114 117 48 116 98 87 50 117 76
109 55 115 114 79 49 116 98 97 52 118 76 71 122 116 76 83 49
116 55 119 61]
String encoded to bytes and then b64 decoded: b'\xb5\xb6\xb6\xb6\xb5\xb6\xba\xb5\xb6\xb6\xb6\xb6\xb7\xba\xb6\xb6\xb7\xb7\xb8\xb9\xba\xb6\xb6\xb7\xb8\xb9\xba\xba\xb6\xb6\xb6\xb7\xb9\xba\xbb\xb4\xb5\xb5\xb6\xb8\xb9\xbb\xb2\xb3\xb5\xb5\xb6\xb8\xbc\xb1\xb3\xb4\xb4\xb5\xb7\xbc'
B64-decoded bytes decoded to array: [181 182 182 182 181 182 186 181 182 182 182 182 183 186 182 182 183 183
184 185 186 182 182 183 184 185 186 186 182 182 182 183 185 186 187 180
181 181 182 184 185 187 178 179 181 181 182 184 188 177 179 180 180 181
183 188]
What's interesting to note is that the human-readable re-encoding of the string looks identical to the original b64 encoding but yields a completely different array than the original, whereas the b64 encoding of the string re-encoding looks very different visually but re-creates the original array (without the proper shape, which is typical for a buffer. You'll need to supply that yourself).
I think you need to use df['col'].applymap(lambda x: b64decode(x.encode("utf-8")))
on your saved strings if you want to read them for the purpose of converting them timages.
However this begs the question - why doesn't your_string.encode("utf-8")
as a byte representation work for the API when, b64encode(b64decode(your_string.encode("utf-8")))
should theoretically yield the same representation...I'm not certain about that. Maybe make sure that df['col'].applymap(lambda x: x.encode("utf-8"))
or df['col'].astype(bytes)
as another commenter suggested is giving you what you want?
Further reading: Convert string of base64 back to base64 bytes , How do you decode Base64 data in Python?
Upvotes: 1