tourist
tourist

Reputation: 4333

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.

import array
from io import StringIO

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)

def mapper(features):
    a = array.array('f')
    a.frombytes(features)
    return a.tolist()

def byte_mapper(bytes):
    return str(bytes)

decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])

When just product_id is selected from the rdd using

decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])

The output for product_id is

["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7@\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]

The file is hosted on s3. The file in each row has first 10 bytes for product_id next 4096 bytes as image_features I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.

Upvotes: 1

Views: 3979

Answers (1)

blackbishop
blackbishop

Reputation: 32680

EDIT:

Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)

Should work. Actually you can find this in the provided code from the web site you downloaded the binary file:

for i in range(4096):
     feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4

Old answer:

I think the issue comes from your byte_mapper function. That's not the correct way to convert bytes to string. You should be using decode:

bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"

print(bytes.decode("utf-8"))
# output: '1582480311'

If you're getting the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte

That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.

However, you may want to ignore those characters by adding option ignore to decode function:

bytes.decode("utf-8", "ignore") 

Upvotes: 2

Related Questions