Russell Burdt
Russell Burdt

Reputation: 2673

Python error using pyarrow - ArrowNotImplementedError: Support for codec 'snappy' not built

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow=0.17. The error does not appear in pyarrow=1.0.1 and does appear in pyarrow=2.0.0. The idea is to write a pandas DataFrame as a Parquet Dataset (on Windows) using Snappy compression, and later to process the Parquet Dataset using Spark.

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({
    'x': [0, 0, 0, 1, 1, 1], 
    'a': np.random.random(6), 
    'b': np.random.random(6)})
table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path=r'c:/data', partition_cols=['x'], flavor='spark')

enter image description here

Upvotes: 16

Views: 25240

Answers (5)

Reddspark
Reddspark

Reputation: 7567

I managed to get it to work by doing a pip install pyArrow from Conda prompt.

Upvotes: 0

clg4
clg4

Reputation: 2953

I had the exact same issue. Did fresh install of Anaconda 3.8. then did conda install -c conda-forge pyarrow from this link "https://anaconda.org/conda-forge/pyarrow". It chokes through this install but fails with frozen/flexible solve and conda keeps trying different variants until finally it installs. You can then import pyarrow. But then, when you try to open a parquet file, you get the 'snappy' codec error - the subject of this thread.

I then did conda remove pyarrow so I was back to a clean install. Then pip install pyarrow, and I could successfully load the parquet file.

Upvotes: 1

Michel K
Michel K

Reputation: 701

Something is wrong with the conda install pyarrow method. I removed it with conda remove pyarrow and after that installed it with pip install pyarrow. This ended up working.

Upvotes: 13

Pace
Pace

Reputation: 43817

The pyarrow package you had installed did not come from conda-forge and it does not appear to match the package on PYPI. I did a bit more research and pypi_0 just means the package was installed via pip. It does not mean it actually came from PYPI.

I'm not really sure how this happened. You could maybe check your conda log (envs/YOUR-ENV/conda-meta/history) but, given that this was installed external from conda, I'm not sure there will be any meaningful information in there. Perhaps you tried to install Arrow after the version was bumped to 3 and before the wheels were uploaded and so your system fell back to building from source?

Upvotes: 10

0x26res
0x26res

Reputation: 13902

I'm not 100%, but it could be because since version 1.0.0 they slimmed down the default arrow build and snappy became an optional component, see

I think you would have to rebuild arrow using -DARROW_WITH_SNAPPY=ON, see. But this can be quite difficult and tedious to get to work.

Another option would be to disable snappy:

pq.write_to_dataset(table, root_path=r'c:/data', partition_cols=['x'], flavor='spark', compression="NONE")

Upvotes: -1

Related Questions