How to read binary data in pyspark

Question

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.

import array
from io import StringIO

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)

def mapper(features):
    a = array.array('f')
    a.frombytes(features)
    return a.tolist()

def byte_mapper(bytes):
    return str(bytes)

decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])

When just product_id is selected from the rdd using

decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])

The output for product_id is

["b'1582480311'", "b'\x00\x00\x00\x00\x88c-?\xeb\xe2'", "b'7@\x00\x00\x00\x00\x00\x00\x00\x00'", "b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'", "b'\xec/\x0b?\x00\x00\x00\x00K\xea'", "b'\x00\x00c\x7f\xd9?\x00\x00\x00\x00'", "b'L\xa6\n>\x00\x00\x00\x00\xfe\xd4'", "b'\x00\x00\x00\x00\x00\x00\xe5\xd0\xa2='", "b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'", "b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'"]

The file is hosted on s3. The file in each row has first 10 bytes for product_id next 4096 bytes as image_features I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.

blackbishop · Accepted Answer

EDIT:

Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)

Should work. Actually you can find this in the provided code from the web site you downloaded the binary file:

for i in range(4096):
     feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4

Old answer:

I think the issue comes from your byte_mapper function. That's not the correct way to convert bytes to string. You should be using decode:

bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"

print(bytes.decode("utf-8"))
# output: '1582480311'

If you're getting the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte

That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.

However, you may want to ignore those characters by adding option ignore to decode function:

bytes.decode("utf-8", "ignore")

How to read binary data in pyspark

Answers (1)

Related Questions