Reputation: 189
I am trying to serialise spacy.io documents into byte strings and save them in a numpy
array.
spacy
has a to_bytes
function which produces a bytearray
. I call str
on this bytearray
and insert that string object into a numpy
array. This works for most documents except those that end with a trailing zero byte.
To reproduce:
>>> import numpy as np
>>> b_arr = bytearray(b'\xca\x00\x00\x00n\xff\xff\xff\x19C\x98\xc9\x06\xb18{\xa5\xe0\xaf6\xe3\x9f\xa7\xad\x86\xd6\x8d\xc0\xe6Mo;{\x96xm\x80\xe5\x8c\x9f<!\xc33\x9dg\xd3\xb3D\xf6\xac\x03P\x8do\x07m$r)\x06XBI\xc87\xcao\x83\x1d\xe4\r]\x86\xda\xeb\xb8\x1f\xd5\xcb\xde\xaa\x85r\x0f\xf1=p\xd6\x01\xdc\x83Z|&\xeb\xce|\xf9o\xa0\xe99x\x87\x87\xac\x1b\x17\x08\x000\x92\x10A\x98\x10\x13\x89( 0\x88 "!*N\xf8\xe6\xf4\r\xb1e\xf0\x9d\xfd\x80\xa2G2\x18\xdesv\xec\x85\xf7\xb1\xb3\xb3\xa68\xa7n\xe8BF\xa6\xe0\xb1\x8d\x8d\x9c\xe5\x99\x9bV\xfcE`\x1cI\x92$I\x92$I\x92$%I\x92\xe4\xff\xff\x7f\xd1\xff\xf0T\xa6\xe8\n\x9a\xd3\xffMe0\xa9\x15\xf1|\x00')
>>> b_arr_text = str(b_arr)
>>> b_arr_np = np.asarray([b_arr_text], dtype=np.str)
>>> b_arr_text == b_arr_np[0]
Out[229]: False
>>> len(b_arr_text)
Out[230]: 206
>>> len(b_arr_np[0])
Out[231]: 205
>>> b_arr_np.dtype
Out[232]: dtype('S206')
The numpy
string trims any trailing zeros, the dtype for the fixed length string is the same length as the input text however.
You can see this even from creating any bytestring with trailing zero bytes in an array:
>>> np.asarray(['\xca\x00\x00\x00'], dtype=np.str)
Out: array(['\xca'], dtype='|S4')
I presume numpy
deems trailing zeros to be insignificant? However I can't deserialize these bytestrings back to a spacy
document object.
Is there any way to get numpy
not to trim the trailing zeros or do I have to stick to Python lists for this scenario?
Upvotes: 1
Views: 2151
Reputation: 512
You want the np.void
dtype.
String or bytes arrays will always chop off the trailing zeros.
a = np.array([b"\x00\x00"], dtype=np.str)
a
# Out: array([''], dtype='<U2')
a[0]
# Out: ''
But a void array won't.
a = np.array([b"\x00\x00"], dtype=np.void)
a
# Out: array([b'\x00\x00'], dtype='|V2')
a[0]
# Out: void(b'\x00\x00')
There is a slight complication that each array element is now wrapped in a void(...)
but you can fix that with either:
a[0].item()
# Out: b'\x00\x00'
or for the whole array:
a = a.astype(object)
a
# Out: array([b'\x00\x00'], dtype=object)
a[0]
# Out: b'\x00\x00'
If you replace the line
b_arr_np = np.asarray([b_arr_text], dtype=np.str)
with
b_arr_np = np.asarray([b_arr_text], dtype=np.void).astype(object)
then your example behaves as you expected it.
Upvotes: 0
Reputation: 2936
It's normal behavior. After b_arr_np.tostring()
you can see, that all trailing zeros are in order.
b_arr = bytearray(b'\xca\x00\x00\x00')
b_arr_text = str(b_arr)
b_arr_np = np.asarray([b_arr_text], dtype=np.str)
b_arr_np
Out[303]:
array(['\xca'],
dtype='|S4')
b_arr_np.tostring()
Out[304]: '\xca\x00\x00\x00'
Check post information loss with bytes type from github. Issues are either use trailng non-zero bytes or use dtype=uint8
with b_arr
:
b_arr_np = np.asarray([b_arr], dtype=np.uint8)
b_arr_np
Out[319]: array([[202, 0, 0, 0]], dtype=uint8)
b_arr_np.tostring()
Out[320]: '\xca\x00\x00\x00'
Upvotes: 4