Reputation: 5974
Ultimately, my goal is to make a pyarrow.DictionaryArray with an ExtensionType.
For all other kinds of Arrow arrays, I can use the Array.from_buffers static method to construct it and pass the ExtensionType as its first argument. However, I don't see a way to use from_buffers on DictionaryArray because I need to pass its dictionary, but it takes zero children.
Given a DicationaryArray a
and naively using its from_buffers (assuming the dictionary is in its type, which I'm pretty sure it's not) results in a segfault.
>>> import pyarrow as pa
>>> a = pa.array(["one", "two", "three", "two", "one"]).dictionary_encode()
>>> b = pa.DictionaryArray.from_buffers(a.type, len(a), a.indices.buffers())
/arrow/cpp/src/arrow/array/array_dict.cc:83: Check failed: (data->dictionary) != (nullptr)
/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.600(+0xf17fe8)[0x7f9dc4008fe8]
/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.600(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7f9dc400978d]
/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.600(_ZN5arrow15DictionaryArrayC2ERKSt10shared_ptrINS_9ArrayDataEE+0x11f)[0x7f9dc427d00f]
/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.600(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x2c6)[0x7f9dc41381b6]
/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/lib.cpython-39-x86_64-linux-gnu.so(+0x213a77)[0x7f9dc5504a77]
python(+0x15f995)[0x55b45da91995]
python(_PyObject_MakeTpCall+0x316)[0x55b45da783d6]
python(_PyEval_EvalFrameDefault+0x52de)[0x55b45db162ce]
python(+0x138e20)[0x55b45da6ae20]
python(_PyEval_EvalCodeWithName+0x47)[0x55b45db4f977]
python(PyEval_EvalCodeEx+0x39)[0x55b45db4f9b9]
python(PyEval_EvalCode+0x1b)[0x55b45db4f9db]
python(+0x2506c9)[0x55b45db826c9]
python(+0x28b994)[0x55b45dbbd994]
python(+0x1142bf)[0x55b45da462bf]
python(PyRun_InteractiveLoopFlags+0xeb)[0x55b45da4646a]
python(+0x11487a)[0x55b45da4687a]
python(+0x114e14)[0x55b45da46e14]
python(Py_BytesMain+0x39)[0x55b45dbc4329]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f9dc6abd0b3]
python(+0x20aa51)[0x55b45db3ca51]
Aborted (core dumped)
I didn't think that would work, anyway.
For completeness, I also thought that there might be casting rules from each storage type to an ExtensionType with that storage type, but no, cast doesn't work, either.
>>> import json
>>> import pyarrow as pa
>>> a = pa.array(["one", "two", "three", "two", "one"]).dictionary_encode()
>>>
>>> class AnnotatedType(pa.ExtensionType):
... def __init__(self, storage_type, annotation):
... self.annotation = annotation
... super().__init__(storage_type, "my:app")
... def __arrow_ext_serialize__(self):
... return json.dumps(self.annotation).encode()
... @classmethod
... def __arrow_ext_deserialize__(cls, storage_type, serialized):
... annotation = json.loads(serialized.decode())
... return cls(storage_type, annotation)
... @property
... def num_buffers(self):
... return self.storage_type.num_buffers
... @property
... def num_fields(self):
... return self.storage_type.num_fields
...
>>> b = a.cast(AnnotatedType(a.type, {"some": "data"}))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 825, in pyarrow.lib.Array.cast
File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/compute.py", line 309, in cast
return call_function("cast", [arr], options)
File "pyarrow/_compute.pyx", line 528, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 327, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from dictionary<values=string, indices=int32, ordered=0> to extension<my:app<AnnotatedType>> (no available cast function for target type)
Can DictionaryArrays have ExtensionType? (Calling dictionary_encode on an array with ExtensionType is not an option. My DictionaryArrays will come fully built; I'm hoping to rebuild them with a new type without expanding them out.)
Upvotes: 1
Views: 1233
Reputation: 139162
The crash of DictionaryArray.from_buffers
seems a bug (I opened https://issues.apache.org/jira/browse/ARROW-14495, I think it can actually be fixed to work).
But specifically for a DictionaryArray, there is an alternative constructor: DictionaryArray.from_arrays
that can be used here.
Using your example AnnotatedType extension type, let's first create a small array of this type for the DictionaryArray's dictionary:
>>> dictionary = pa.array(["one", "two", "three"], pa.string())
>>> dictionary_ext = pa.ExtensionArray.from_storage(AnnotatedType(pa.string(), "annotation"), dictionary)
Now we can use this dictionary to create the DictionaryArray, together with indices:
>>> arr = pa.DictionaryArray.from_arrays(pa.array([0,1,2,0,1]), dictionary_ext)
>>> arr
<pyarrow.lib.DictionaryArray object at 0x7f7b0e630c80>
-- dictionary:
[
"one",
"two",
"three"
]
-- indices:
[
0,
1,
2,
0,
1
]
>>> arr.type
DictionaryType(dictionary<values=extension<my:app<AnnotatedType>>, indices=int64, ordered=0>)
Upvotes: 3