dsimmie
dsimmie

Reputation: 189

numpy trimming trailing zeros in byte strings

I am trying to serialise spacy.io documents into byte strings and save them in a numpy array.

spacy has a to_bytes function which produces a bytearray. I call str on this bytearray and insert that string object into a numpy array. This works for most documents except those that end with a trailing zero byte.

To reproduce:

>>> import numpy as np
>>> b_arr = bytearray(b'\xca\x00\x00\x00n\xff\xff\xff\x19C\x98\xc9\x06\xb18{\xa5\xe0\xaf6\xe3\x9f\xa7\xad\x86\xd6\x8d\xc0\xe6Mo;{\x96xm\x80\xe5\x8c\x9f<!\xc33\x9dg\xd3\xb3D\xf6\xac\x03P\x8do\x07m$r)\x06XBI\xc87\xcao\x83\x1d\xe4\r]\x86\xda\xeb\xb8\x1f\xd5\xcb\xde\xaa\x85r\x0f\xf1=p\xd6\x01\xdc\x83Z|&\xeb\xce|\xf9o\xa0\xe99x\x87\x87\xac\x1b\x17\x08\x000\x92\x10A\x98\x10\x13\x89( 0\x88 "!*N\xf8\xe6\xf4\r\xb1e\xf0\x9d\xfd\x80\xa2G2\x18\xdesv\xec\x85\xf7\xb1\xb3\xb3\xa68\xa7n\xe8BF\xa6\xe0\xb1\x8d\x8d\x9c\xe5\x99\x9bV\xfcE`\x1cI\x92$I\x92$I\x92$%I\x92\xe4\xff\xff\x7f\xd1\xff\xf0T\xa6\xe8\n\x9a\xd3\xffMe0\xa9\x15\xf1|\x00')
>>> b_arr_text = str(b_arr)
>>> b_arr_np = np.asarray([b_arr_text], dtype=np.str)
>>> b_arr_text == b_arr_np[0]
Out[229]: False
>>> len(b_arr_text)
Out[230]: 206
>>> len(b_arr_np[0])
Out[231]: 205
>>> b_arr_np.dtype
Out[232]: dtype('S206')

The numpy string trims any trailing zeros, the dtype for the fixed length string is the same length as the input text however.

You can see this even from creating any bytestring with trailing zero bytes in an array:

>>> np.asarray(['\xca\x00\x00\x00'], dtype=np.str)

Out: array(['\xca'], dtype='|S4')

I presume numpy deems trailing zeros to be insignificant? However I can't deserialize these bytestrings back to a spacy document object.

Is there any way to get numpy not to trim the trailing zeros or do I have to stick to Python lists for this scenario?

Upvotes: 1

Views: 2151

Answers (2)

pullmyteeth
pullmyteeth

Reputation: 512

You want the np.void dtype.

String or bytes arrays will always chop off the trailing zeros.

a = np.array([b"\x00\x00"], dtype=np.str)
a
# Out: array([''], dtype='<U2')
a[0]
# Out: ''    

But a void array won't.

a = np.array([b"\x00\x00"], dtype=np.void)
a
# Out: array([b'\x00\x00'], dtype='|V2')
a[0]
# Out: void(b'\x00\x00')

There is a slight complication that each array element is now wrapped in a void(...) but you can fix that with either:

a[0].item()
# Out: b'\x00\x00'

or for the whole array:

a = a.astype(object)
a
# Out: array([b'\x00\x00'], dtype=object)
a[0]
# Out: b'\x00\x00'

If you replace the line

b_arr_np = np.asarray([b_arr_text], dtype=np.str)

with

b_arr_np = np.asarray([b_arr_text], dtype=np.void).astype(object)

then your example behaves as you expected it.

Upvotes: 0

Vadim Shkaberda
Vadim Shkaberda

Reputation: 2936

It's normal behavior. After b_arr_np.tostring() you can see, that all trailing zeros are in order.

b_arr = bytearray(b'\xca\x00\x00\x00')

b_arr_text = str(b_arr)

b_arr_np = np.asarray([b_arr_text], dtype=np.str)

b_arr_np
Out[303]: 
array(['\xca'], 
      dtype='|S4')

b_arr_np.tostring()
Out[304]: '\xca\x00\x00\x00'

Check post information loss with bytes type from github. Issues are either use trailng non-zero bytes or use dtype=uint8 with b_arr:

b_arr_np = np.asarray([b_arr], dtype=np.uint8)

b_arr_np
Out[319]: array([[202,   0,   0,   0]], dtype=uint8)

b_arr_np.tostring()

Out[320]: '\xca\x00\x00\x00'

Upvotes: 4

Related Questions