KerikoN
KerikoN

Reputation: 36

How to store and load multi-column index pandas dataframes with parquet

I have a dataset similar to:

initial_df = pd.DataFrame([{'a': 0, 'b': 0, 'c': 10.898}, {'a': 0, 'b': 1, 'c': 1.88}, {'a': 1, 'b': 0, 'c': 108.1}, {'a': 1, 'b': 1, 'c': 10.898}])
initial_df.set_index(['a', 'b'], inplace=True)

I am able to store it completely fine (append = False plays no role in this example but is used with a variable in the actual code):

initial_df.to_parquet('test.parquet', engine='fastparquet', compression='GZIP', append=False, index=True)

I am also able to load it completely fine:

read_df = pd.read_parquet('test.parquet', engine='fastparquet')
read_df

This is how the dataset looks:

data in dataframe

dataframe.info() output

But this is where the issue begins. In my application I will have to append a new dataframe to existing files and index (in this example 'a') will be incremented while index (in this example 'b') will be looped.

additional_df = pd.DataFrame([{'a': 2, 'b': 0, 'c': 10.898}, {'a': 2, 'b': 1, 'c': 1.88}, {'a': 3, 'b': 0, 'c': 108.1}, {'a': 3, 'b': 1, 'c': 10.898}])
additional_df.set_index(['a', 'b'], inplace=True)

After I store this additional data using:

additional_df.to_parquet('test.parquet', engine='fastparquet', compression='GZIP', append=True, index=True)

When I try to retrieve it with:

read_df = pd.read_parquet('test.parquet', engine='fastparquet')

I get an error: RuntimeError: Different dictionaries encountered while building categorical Error location pandas\io\parquet.py:358

VERSIONS:
python: 3.10.8
pandas: 1.5.1
fastparquet: 0.8.3 (also tested with older 0.5.0)

I tried debugging the source code to better understand why the RuntimeError is raised, but the only thing I was able to figure out from that was that the read_col function from fastparquet\core.py:170 is called multiple times for each column causing the index to be written twice more than required and on the second attempt to write it the error is raised.

I also played around with index parameter of read_parquet but I do not believe that this is causing the issue.

Upvotes: 2

Views: 915

Answers (1)

KerikoN
KerikoN

Reputation: 36

I have not really solved the specific problem I had and would still appreciate any input anyone has, but I was able to work around it using a method suggested by a friend.

Instead of appending to one file I am now using a directory of files where each one has the same DataFrame structure. The functions I had problems with were replaced as such:

  • Appending --> Just write a new file to the output directory (each unqiue and/or seperate DataFrame structure should have its own directory).

pd.to_parquet("./directory/new_file.parquet", engine='pyarrow', compression='gzip', index=True)

  • Read all data together --> works by just reading the directory (all DataFrames in directory will be merged, they must have the same structure!)

pd.read_parquet("./directory", engine='pyarrow')

Also I am now using pyarrow as an engine instead of fastparquet.

Upvotes: 0

Related Questions