Cenoc
Cenoc

Reputation: 11662

hdfstore error on append with pandas

I get the following error:

    exportStore.append(key, hdfStoreLocal, index = False, data_columns = True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 911, in append
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 1270, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 3605, in write
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 3293, in create_axes
    raise e
ValueError: invalid itemsize in generic type tuple

Any ideas on why this would happen? It's a rather large project, so I'm not sure what code I can offer, but this happens on the first append. Any help would be very much appreciated.

EDIT::::::

Show Version result:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: None
Cython: 0.20.2
numpy: 1.8.1
scipy: 0.13.3
statsmodels: None
IPython: 1.2.1
sphinx: 1.2.2
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

Info result:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61500 entries, 0 to 61499
Data columns (total 48 columns):
Sequential_Code_1        61500 non-null float64
Age_1                    61500 non-null float64
Sex_1                    61500 non-null object
Race_1                   61500 non-null object
Ethnicity_1              61500 non-null object
Principal_Code_1         61500 non-null object
Admitting_Code_1         61500 non-null object
Principal_Code_2         61500 non-null object
Other_Codes_1            61500 non-null object
Other_Codes_2            61500 non-null object
Other_Codes_3            61500 non-null object
Other_Codes_4            61500 non-null object
Other_Codes_5            61500 non-null object
Other_Codes_6            61500 non-null object
Other_Codes_7            61500 non-null object
Other_Codes_8            61500 non-null object
Other_Codes_9            61500 non-null object
Other_Codes_10           61500 non-null object
Other_Codes_11           61500 non-null object
Other_Codes_12           61500 non-null object
Other_Codes_13           61500 non-null object
Other_Codes_14           61500 non-null object
Other_Codes_15           61500 non-null object
Other_Codes_16           61500 non-null object
Other_Codes_17           61500 non-null object
Other_Codes_18           61500 non-null object
Other_Codes_19           61500 non-null object
Other_Codes_20           61500 non-null object
Other_Codes_21           61500 non-null object
Other_Codes_22           61500 non-null object
Other_Codes_23           61500 non-null object
Other_Codes_24           61500 non-null object
External_Code_1          61500 non-null object
Place_Code_1             61500 non-null object

Head:

head       Sequential_Number_1  Age_1 Sex_1 Race_1  \
1128                   2.000000e+13     73             F             01   
2185                   2.000000e+13     52             M             01   
2202                   2.000000e+13     64             M             01   
2283                   2.000000e+13     72             F             01   
4471                   2.000000e+13     62             F             01 

Upvotes: 2

Views: 792

Answers (1)

Jeff
Jeff

Reputation: 128948

The problem is that you need to specify a min_itemsize, see docs here.

This controls how big the column is for string-like columns. If you don't have any length to ANY values it fails (prob could be a better error message). It will take the biggest length of the passed values to figure out what size it needs to be.

The reason to specify this is that say you are appending in multiple chunks. You could have a longer string in chunk 2 which means the column should be at least that size, but only looking at chunk 1 doesn't tell you this.

Further would pre-process this data to not have 0-len strings instead use np.nan as the missing value (which HDFstore / pandas) handle properly.

Upvotes: 1

Related Questions