Jim Pivarski
Jim Pivarski

Reputation: 5974

How to make a pyarrow.DictionaryArray with ExtensionType? Using from_buffers? Using cast?

Ultimately, my goal is to make a pyarrow.DictionaryArray with an ExtensionType.

For all other kinds of Arrow arrays, I can use the Array.from_buffers static method to construct it and pass the ExtensionType as its first argument. However, I don't see a way to use from_buffers on DictionaryArray because I need to pass its dictionary, but it takes zero children.

Given a DicationaryArray a and naively using its from_buffers (assuming the dictionary is in its type, which I'm pretty sure it's not) results in a segfault.

>>> import pyarrow as pa
>>> a = pa.array(["one", "two", "three", "two", "one"]).dictionary_encode()
>>> b = pa.DictionaryArray.from_buffers(a.type, len(a), a.indices.buffers())
/arrow/cpp/src/arrow/array/array_dict.cc:83:  Check failed: (data->dictionary) != (nullptr) 
/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.600(+0xf17fe8)[0x7f9dc4008fe8]
/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.600(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7f9dc400978d]
/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.600(_ZN5arrow15DictionaryArrayC2ERKSt10shared_ptrINS_9ArrayDataEE+0x11f)[0x7f9dc427d00f]
/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.600(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x2c6)[0x7f9dc41381b6]
/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/lib.cpython-39-x86_64-linux-gnu.so(+0x213a77)[0x7f9dc5504a77]
python(+0x15f995)[0x55b45da91995]
python(_PyObject_MakeTpCall+0x316)[0x55b45da783d6]
python(_PyEval_EvalFrameDefault+0x52de)[0x55b45db162ce]
python(+0x138e20)[0x55b45da6ae20]
python(_PyEval_EvalCodeWithName+0x47)[0x55b45db4f977]
python(PyEval_EvalCodeEx+0x39)[0x55b45db4f9b9]
python(PyEval_EvalCode+0x1b)[0x55b45db4f9db]
python(+0x2506c9)[0x55b45db826c9]
python(+0x28b994)[0x55b45dbbd994]
python(+0x1142bf)[0x55b45da462bf]
python(PyRun_InteractiveLoopFlags+0xeb)[0x55b45da4646a]
python(+0x11487a)[0x55b45da4687a]
python(+0x114e14)[0x55b45da46e14]
python(Py_BytesMain+0x39)[0x55b45dbc4329]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f9dc6abd0b3]
python(+0x20aa51)[0x55b45db3ca51]
Aborted (core dumped)

I didn't think that would work, anyway.

For completeness, I also thought that there might be casting rules from each storage type to an ExtensionType with that storage type, but no, cast doesn't work, either.

>>> import json
>>> import pyarrow as pa
>>> a = pa.array(["one", "two", "three", "two", "one"]).dictionary_encode()
>>> 
>>> class AnnotatedType(pa.ExtensionType):
...     def __init__(self, storage_type, annotation):
...         self.annotation = annotation
...         super().__init__(storage_type, "my:app")
...     def __arrow_ext_serialize__(self):
...         return json.dumps(self.annotation).encode()
...     @classmethod
...     def __arrow_ext_deserialize__(cls, storage_type, serialized):
...         annotation = json.loads(serialized.decode())
...         return cls(storage_type, annotation)
...     @property
...     def num_buffers(self):
...         return self.storage_type.num_buffers
...     @property
...     def num_fields(self):
...         return self.storage_type.num_fields
... 
>>> b = a.cast(AnnotatedType(a.type, {"some": "data"}))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 825, in pyarrow.lib.Array.cast
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/compute.py", line 309, in cast
    return call_function("cast", [arr], options)
  File "pyarrow/_compute.pyx", line 528, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 327, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from dictionary<values=string, indices=int32, ordered=0> to extension<my:app<AnnotatedType>> (no available cast function for target type)

Can DictionaryArrays have ExtensionType? (Calling dictionary_encode on an array with ExtensionType is not an option. My DictionaryArrays will come fully built; I'm hoping to rebuild them with a new type without expanding them out.)

Upvotes: 1

Views: 1233

Answers (1)

joris
joris

Reputation: 139162

The crash of DictionaryArray.from_buffers seems a bug (I opened https://issues.apache.org/jira/browse/ARROW-14495, I think it can actually be fixed to work).

But specifically for a DictionaryArray, there is an alternative constructor: DictionaryArray.from_arrays that can be used here.

Using your example AnnotatedType extension type, let's first create a small array of this type for the DictionaryArray's dictionary:

>>> dictionary = pa.array(["one", "two", "three"], pa.string())
>>> dictionary_ext = pa.ExtensionArray.from_storage(AnnotatedType(pa.string(), "annotation"), dictionary)

Now we can use this dictionary to create the DictionaryArray, together with indices:

>>> arr = pa.DictionaryArray.from_arrays(pa.array([0,1,2,0,1]), dictionary_ext)
>>> arr
<pyarrow.lib.DictionaryArray object at 0x7f7b0e630c80>

-- dictionary:
  [
    "one",
    "two",
    "three"
  ]
-- indices:
  [
    0,
    1,
    2,
    0,
    1
  ]

>>> arr.type
DictionaryType(dictionary<values=extension<my:app<AnnotatedType>>, indices=int64, ordered=0>)

Upvotes: 3

Related Questions