Pandas DataFrame with categorical columns from a Parquet file using read_parquet?

Question

I am converting large CSV files into Parquet files for further analysis. I read in the CSV data into Pandas and specify the column dtypes as follows

_dtype = {"column_1": "float64",
          "column_2": "category",
          "column_3": "int64",
          "column_4": "int64"}

df = pd.read_csv("data.csv", dtype=_dtype)

I then do some more data cleaning and write the data out into Parquet for downstream use.

_parquet_kwargs = {"engine": "pyarrow",
                   "compression": "snappy",
                   "index": False}

df.to_parquet("data.parquet", **_parquet_kwargs)

But when I read the data into Pandas for further analysis using from_parquet I can not seem to recover the category dtypes. The following

df = pd.read_parquet("data.parquet")

results in a DataFrame with object dtypes in place of the desired category.

The following seems to work as expected

import pyarrow.parquet as pq

_table = (pq.ParquetFile("data.parquet")
            .read(use_pandas_metadata=True))

df = _table.to_pandas(strings_to_categorical=True)

however I would like to know how this can be done using pd.read_parquet.

Marc Garcia · Accepted Answer

This is fixed in Arrow 0.15, now the next code keeps the columns as categories (and the performance is significantly faster):

import pandas

df = pandas.DataFrame({'foo': list('aabbcc'),
                       'bar': list('xxxyyy')}).astype('category')

df.to_parquet('my_file.parquet')
df = pandas.read_parquet('my_file.parquet')
df.dtypes

Pandas DataFrame with categorical columns from a Parquet file using read_parquet?

Answers (2)

Related Questions