Reputation: 1741
I was able to import a text file on an elasticsearch index in mylocal machine.
Despite using virtual environment, on the production machine is a nightmare, because I keep having errors like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 79: ordinal not in range(128)
I am using python3 and I personally was having less issues in python2, maybe it is just frustration of wasted couple of hours.
I can't understand why, I am not able to strip or handle non ascii chars:
I tried to import:
from unidecode import unidecode
def remove_non_ascii(text):
return unidecode(unicode(text, encoding = "utf-8"))
using python2, no success.
back on python3:
import string
printable = set(string.printable)
''.join( filter(lambda x: x in printable, 'mixed non ascii string' )
no success
import codecs
with codecs.open(path, encoding='utf8') as f:
....
no success
tried:
# -*- coding: utf-8 -*-
no success
https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize
no success ...
All of the above seems can't strip or handle the non ascii, it is very cumbersome, I keep on having following errors:
with open(path) as f:
for line in f:
line = line.replace('\n','')
el = line.split('\t')
print (el)
_id = el[0]
_source = el[1]
_name = el[2]
# _description = ''.join( filter(lambda x: x in printable, el[-1]) )
#
_description = remove_non_ascii( el[-1] )
print (_id, _source, _name, _description, setTipe( _source ) )
action = {
"_index": _indexName,
"_type": setTipe( _source ),
"_id": _source,
"_source": {
"name": _name,
"description" : _description
}
}
helpers.bulk(es, [action])
File "<stdin>", line 22, in <module>
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 194, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 162, in streaming_bulk
for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 87, in _process_bulk_chunk
resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 79: ordinal not in range(128)
I would like to have a "definite" practice to handle encoding problems in python3 - I am using same scripts on different machines, and having different results...
Upvotes: 0
Views: 604
Reputation: 288
ASCII characters are 0-255.
def remove_non_ascii(text):
ascii_characters = ""
for character in text:
if ord(character) <= 255:
ascii_characters += character
return ascii_characters
Upvotes: 1