Tyler Acorn
Tyler Acorn

Reputation: 75

Using blosc compression in pandas causes heap corruption

I have been using Pandas for a while but I am new to HDF5 so I am trying to learn it and convert some of my research datafiles to HDF5 files. I've looked through a bunch of the SO posts about python and HDF5 and I am interested in using the BLOSC compression algorithm (We do a lot of calculations with the data sets so read/write speed is a higher priority than storage size).

In using the pandas.to_hdf I have run into issues with blosc compression library. When I use blosc, python crashes and when I open the debug in Visual Studio 2010 I get

Unhandled exception at 0x00007ffcd59fa28c in python.exe: 0xC0000374: A heap has been corrupted.

I have set up a separate example in a script and get the same issue:

import pandas as pd

test = pd.DataFrame()
test['random1'] = np.random.randn(1000000)
test['random2'] = np.random.randn(1000000)
test['random3'] = np.random.randn(1000000)

# Write out a csv first to compare file sizes
test.to_csv('./examples/data/random_3c.csv')

# Write out using different compression algorithms to compare
test.to_hdf('./examples/data/random_3c_zlib.h5',
            key='Random_3Col', mode='w', format='table', 
            append=False, complevel=9, complib='zlib', fletcher32=True)

test.to_hdf('./examples/data/random_3c_blosc.h5',
            key='Random_3Col', mode='w', format='table', 
            append=False, complevel=9, complib='blosc', fletcher32=True)

The csv writes out fine (file size of 65,217 kb)
The zlib compression writes out fine (files size of 21,719 kb)
the blosc compression crashes the kernel and I get a heap corruption message when I open the debug in VS
My pandas version is 0.16.2
My PyTables version is 3.2.0
I also have installed hdf5 from the hdfgroup
And I'm working on a windows machine

At this point I'm not even really sure how to start tracking down what's causing the crash. Any suggestions or has anyone seen this before? I found some cases of people having issues on SO when trying to use an external blosc library but I haven't come close to touching that yet. I figure I'll get the basics working first! As far as I know pandas is using pytables which comes bundled with a version of blosc.

Thanks!

Upvotes: 3

Views: 760

Answers (1)

xgdgsc
xgdgsc

Reputation: 1367

If you are using anaconda distribution, it is an package building issue: Pytables 3.2, python 3.4 under windows x64 · Issue #458 · ContinuumIO/anaconda-issues. You can watch and wait for the fix.

Upvotes: 1

Related Questions