Reputation: 1428
I have a large pandas dataframe with a string column that is highly skewed in size of the strings. Most of the rows have string of length < 20, but there are some rows whose string lengths are more than 2000.
I store this dataframe on disk using pandas.HDFStorage.append and set min_itemsize = 4000. However, this approach is highly inefficient, as the hdf5 file is very large in size, and we know that most of it are empty.
Is it possible to assign different sizes for the rows of this string column? That is, assign small min_itemsize to rows whose string is short, and assign large min_itemsize to rows whose string is long.
Upvotes: 4
Views: 4193
Reputation: 128948
When using HDFStore
to store strings, the maximum length of a string in the column is the width for that particular columns, this can be customized, see here.
Several options are available to handle various cases. Compression can help a lot.
In [6]: df = DataFrame({'A' : ['too']*10000})
In [7]: df.iloc[-1] = 'A'*4000
In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 1 columns):
A 10000 non-null object
dtypes: object(1)
memory usage: 156.2+ KB
These are fixed stores, the strings are stored as object
types, so its not particularly performant; nor are these stores available for query / appends.
In [9]: df.to_hdf('test_no_compression_fixed.h5','df',mode='w',format='fixed')
In [10]: df.to_hdf('test_no_compression_table.h5','df',mode='w',format='table')
Table stores are quite flexible, but force a fixed size on the storage.
In [11]: df.to_hdf('test_compression_fixed.h5','df',mode='w',format='fixed',complib='blosc')
In [12]: df.to_hdf('test_compression_table.h5','df',mode='w',format='table',complib='blosc')
Generally using a categorical representation provides run-time and storage efficiency.
In [13]: df['A'] = df['A'].astype('category')
In [14]: df.to_hdf('test_categorical_table.h5','df',mode='w',format='table')
In [15]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 1 columns):
A 10000 non-null category
dtypes: category(1)
memory usage: 87.9 KB
In [18]: ls -ltr *.h5
-rw-rw-r-- 1162080 Aug 31 06:36 test_no_compression_fixed.h5
-rw-rw-r-- 1088361 Aug 31 06:39 test_compression_fixed.h5
-rw-rw-r-- 40179679 Aug 31 06:36 test_no_compression_table.h5
-rw-rw-r-- 259058 Aug 31 06:39 test_compression_table.h5
-rw-rw-r-- 339281 Aug 31 06:37 test_categorical_table.h5
Upvotes: 6