user305883
user305883

Reputation: 1741

Which is best practice to skip non ascii characters in mixed encoded text in python3?

I was able to import a text file on an elasticsearch index in mylocal machine.

Despite using virtual environment, on the production machine is a nightmare, because I keep having errors like:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 79: ordinal not in range(128)

I am using python3 and I personally was having less issues in python2, maybe it is just frustration of wasted couple of hours.

I can't understand why, I am not able to strip or handle non ascii chars:

I tried to import:

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

using python2, no success.

back on python3:

import string
printable = set(string.printable)

''.join( filter(lambda x: x in printable, 'mixed non ascii string' )

no success

import codecs
with codecs.open(path, encoding='utf8') as f:
 ....

no success

tried:

# -*- coding: utf-8 -*- 

no success

https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

no success ...

All of the above seems can't strip or handle the non ascii, it is very cumbersome, I keep on having following errors:

with open(path) as f:
    for line in f:
        line = line.replace('\n','')
        el = line.split('\t')
        print (el)
        _id = el[0]
        _source = el[1]
        _name = el[2]
        # _description = ''.join( filter(lambda x: x in printable, el[-1]) )
        # 
        _description = remove_non_ascii( el[-1] )
        print (_id, _source,  _name, _description, setTipe( _source ) )
        action = {
            "_index": _indexName,
            "_type": setTipe( _source ),
            "_id": _source,
            "_source": {
                "name": _name,
                "description" : _description
                }
            }
        helpers.bulk(es, [action])

  File "<stdin>", line 22, in <module>
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 194, in bulk
    for ok, item in streaming_bulk(client, actions, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 162, in streaming_bulk
    for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 87, in _process_bulk_chunk
    resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 79: ordinal not in range(128)

I would like to have a "definite" practice to handle encoding problems in python3 - I am using same scripts on different machines, and having different results...

Upvotes: 0

Views: 604

Answers (1)

apoorlydrawnape
apoorlydrawnape

Reputation: 288

ASCII characters are 0-255.

def remove_non_ascii(text):
    ascii_characters = ""
    for character in text:
        if  ord(character) <= 255:
            ascii_characters += character
    return ascii_characters

Upvotes: 1

Related Questions