Reputation: 322
I decided to familiarize with arrow package. I figured that it would be a good
idea to run some example of its usage (https://github.com/apache/arrow/tree/master/python/examples/minimal_build).
docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu .
docker run --rm -t -i -v $PWD:/io arrow_ubuntu_minimal /io/build_venv.sh
Unfortunately after running the latter command console yields:
E ModuleNotFoundError: No module named 'pyarrow._dataset'
pyarrow/dataset.py:23: ModuleNotFoundError
====================================================================================== warnings summary ======================================================================================
pyarrow/tests/test_serialization.py:283
/root/arrow/python/pyarrow/tests/test_serialization.py:283: PytestDeprecationWarning: @pytest.yield_fixture is deprecated.
Use @pytest.fixture instead; they are the same.
@pytest.yield_fixture(scope='session')
pyarrow/tests/test_pandas.py::TestConvertListTypes::test_infer_lists
pyarrow/tests/test_pandas.py::TestConvertListTypes::test_to_list_of_structs_pandas
pyarrow/tests/test_pandas.py::TestConvertListTypes::test_nested_large_list
/root/venv/lib/python3.6/site-packages/pandas/core/dtypes/missing.py:475: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
if np.any(np.asarray(left_value != right_value)):
pyarrow/tests/test_pandas.py::TestConvertListTypes::test_nested_large_list
/root/venv/lib/python3.6/site-packages/pandas/core/dtypes/missing.py:475: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
if np.any(np.asarray(left_value != right_value)):
-- Docs: https://docs.pytest.org/en/stable/warnings.html
================================================================================== short test summary info ===================================================================================
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_filesystem - ModuleNotFoundError: No module named 'pyarrow._dataset'
============================================================ 1 failed, 3168 passed, 689 skipped, 16 xfailed, 5 warnings in 48.01s ============================================================
marcin@marcin-G3-3579:
Did anyone run into similar issues or has any idea how to fix it?
I am currently using ubuntu 20.04. Maybe this could cause the problem since example is set on ubuntu 18.04 but I see no way of checking it out.
Upvotes: 1
Views: 1367
Reputation: 43887
That looks like a bug in the minimal example. Feel free to file a JIRA.
The Arrow C++ package has a number of feature flags that can be turned on (to enable functionality) or off (to speed up build time and reduce dependencies). The python tests that depend on certain features should check to see if that flag is present and skip if it is not. This test is not doing that.
In the meantime you can either ignore the test failure, change the test to skip (I think this is adding @pytest.mark.dataset
above the test name), or add datasets to your C++ build (probably my preferred option).
To add datasets to your C++ build you can add -DARROW_DATASET=ON
(next to -DARROW_PARQUET=ON
) in build_venv.sh
.
Upvotes: 4