Reputation: 1052
I have unicode data as read from this file:
Mdt,Doccompra,OrgC,Cen,NumP,Criadopor,Dtcriacao,Fornecedor,P,Fun
400,8751215432,2581,,1,MIGRAÇÃO,01.10.2004,75852214,,TD
400,5464282154,9874,,1,MIGRAÇÃO,01.10.2004,78995411,,FO
I have two problems:
When I try to query this unicode data I get a UnicodeDecodeError
:
Traceback (most recent call last):
File "<ipython-input-1-4423dceb2b1d>", line 1, in <module>
runfile('C:/Users/u5en/Documents/SAP/Programação/Problema HDF.py', wdir='C:/Users/u5en/Documents/SAP/Programação')
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 48, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/u5en/Documents/SAP/Programação/Problema HDF.py", line 15, in <module>
store.select("EKKA", "columns=['Mdt', 'Fornecedor']")
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 665, in select
return it.get_result()
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 3968, in read
if not self.read_axes(where=where, **kwargs):
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 3201, in read_axes
a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2058, in convert
self.data, nan_rep=nan_rep, encoding=encoding)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4359, in _unconvert_string_array
data = f(data)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 1700, in __call__
return self._vectorize_call(func=func, args=vargs)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 1769, in _vectorize_call
outputs = ufunc(*inputs)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4358, in <lambda>
f = np.vectorize(lambda x: x.decode(encoding), otypes=[np.object])
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 7: unexpected end of data
How can I store and query my unicode data in hdf5?
I have many tables with column names I do not know beforehand and which are not proper pytable
names (NaturalNameWarning
). I would like the user to be able to query on this columns, so I wonder how could I query these when their name prevents me? I see this used to have no easy fix, so if that is still the case I will just remove the offending characters from the heading.
import csv
import pandas as pd
dados = pd.read_csv("EKKA - Cópia.csv")
print(dados)
store= pd.HDFStore('teste.h5' , encoding="utf-8")
store.append("EKKA", dados, format="table", data_columns=True)
store.select("EKKA", "columns=['Mdt', 'Fornecedor']")
store.close()
Would I be better off doing this in sqlite
?
Environment:
Upvotes: 1
Views: 1027
Reputation: 129018
So under Python 2.7 on Windows 7, pandas 0.15.2, everything worked as expected, no encoding necessary. However on Python 3.4, the following worked for me. Apparently some characters are not representable in 'utf-8'; 'latin1' encoding usually solves these issues. Note that I had to read the csv in the first place with this encoding.
>>> df = pd.read_csv('../../test.csv',encoding='latin1')
>>> df
Mdt Doccompra OrgC Cen NumP Criadopor Dtcriacao Fornecedor P Fun
0 400 8751215432 2581 NaN 1 MIGRAÇ\xc3O 01.10.2004 75852214 NaN TD
1 400 5464282154 9874 NaN 1 MIGRAÇ\xc3O 01.10.2004 78995411 NaN FO
Further, the encoding must be specified not when opening the store, but on the append/put
calls
>>> df.to_hdf('test.h5','df',format='table',mode='w',data_columns=True,encoding='latin1')
>>> pd.read_hdf('test.h5','df')
Mdt Doccompra OrgC Cen NumP Criadopor Dtcriacao Fornecedor P Fun
0 400 8751215432 2581 NaN 1 MIGRAÇ\xc3O 01.10.2004 75852214 NaN TD
1 400 5464282154 9874 NaN 1 MIGRAÇ\xc3O 01.10.2004 78995411 NaN FO
Once it is written encoded, it is not necessary to specify the encoding when reading.
Upvotes: 1