mvbentes
mvbentes

Reputation: 1052

Pandas HDF5 store unicode error on select query

I have unicode data as read from this file:

Mdt,Doccompra,OrgC,Cen,NumP,Criadopor,Dtcriacao,Fornecedor,P,Fun
400,8751215432,2581,,1,MIGRAÇÃO,01.10.2004,75852214,,TD
400,5464282154,9874,,1,MIGRAÇÃO,01.10.2004,78995411,,FO

I have two problems:

  1. When I try to query this unicode data I get a UnicodeDecodeError:

    Traceback (most recent call last):
      File "<ipython-input-1-4423dceb2b1d>", line 1, in <module>
        runfile('C:/Users/u5en/Documents/SAP/Programação/Problema HDF.py', wdir='C:/Users/u5en/Documents/SAP/Programação')
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
        execfile(filename, namespace)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 48, in execfile
        exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
    
      File "C:/Users/u5en/Documents/SAP/Programação/Problema HDF.py", line 15, in <module>
        store.select("EKKA", "columns=['Mdt', 'Fornecedor']")
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 665, in select
        return it.get_result()
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
        results = self.func(self.start, self.stop, where)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 658, in func
        columns=columns, **kwargs)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 3968, in read
        if not self.read_axes(where=where, **kwargs):
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 3201, in read_axes
        a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2058, in convert
        self.data, nan_rep=nan_rep, encoding=encoding)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4359, in _unconvert_string_array
        data = f(data)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 1700, in __call__
        return self._vectorize_call(func=func, args=vargs)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 1769, in _vectorize_call
        outputs = ufunc(*inputs)
    
      File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4358, in <lambda>
        f = np.vectorize(lambda x: x.decode(encoding), otypes=[np.object])
    
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 7: unexpected end of data
    

How can I store and query my unicode data in hdf5?

  1. I have many tables with column names I do not know beforehand and which are not proper pytable names (NaturalNameWarning). I would like the user to be able to query on this columns, so I wonder how could I query these when their name prevents me? I see this used to have no easy fix, so if that is still the case I will just remove the offending characters from the heading.

    import csv
    import pandas as pd
    dados = pd.read_csv("EKKA - Cópia.csv")
    print(dados)
    store= pd.HDFStore('teste.h5' , encoding="utf-8")
    store.append("EKKA", dados, format="table", data_columns=True)
    store.select("EKKA", "columns=['Mdt', 'Fornecedor']")
    store.close()
    

Would I be better off doing this in sqlite?

Environment:

Upvotes: 1

Views: 1027

Answers (1)

Jeff
Jeff

Reputation: 129018

So under Python 2.7 on Windows 7, pandas 0.15.2, everything worked as expected, no encoding necessary. However on Python 3.4, the following worked for me. Apparently some characters are not representable in 'utf-8'; 'latin1' encoding usually solves these issues. Note that I had to read the csv in the first place with this encoding.

>>> df = pd.read_csv('../../test.csv',encoding='latin1')
>>> df
   Mdt   Doccompra  OrgC  Cen  NumP Criadopor   Dtcriacao  Fornecedor   P Fun
0  400  8751215432  2581  NaN     1  MIGRAÇ\xc3O  01.10.2004    75852214 NaN  TD
1  400  5464282154  9874  NaN     1  MIGRAÇ\xc3O  01.10.2004    78995411 NaN  FO

Further, the encoding must be specified not when opening the store, but on the append/put calls

>>> df.to_hdf('test.h5','df',format='table',mode='w',data_columns=True,encoding='latin1')

>>> pd.read_hdf('test.h5','df')
   Mdt   Doccompra  OrgC  Cen  NumP Criadopor   Dtcriacao  Fornecedor   P Fun
0  400  8751215432  2581  NaN     1  MIGRAÇ\xc3O  01.10.2004    75852214 NaN  TD
1  400  5464282154  9874  NaN     1  MIGRAÇ\xc3O  01.10.2004    78995411 NaN  FO

Once it is written encoded, it is not necessary to specify the encoding when reading.

Upvotes: 1

Related Questions