swap310
swap310

Reputation: 768

UnicodeDecodeError in Python 2.7

I am trying to read a utf-8 encoded xml file in python and I am doing some processing on the lines read from the file something like below:

next_sent_separator_index =  doc_content.find(word_value, int(characterOffsetEnd_value) + 1)

Where doc_content is the line read from the file and word_value is one of the string from the the same line. I am getting encoding related error for above line whenever doc_content or word_value is having some Unicode characters. So, I tried to decode them first with utf-8 decoding (instead of default ascii encoding) as below :

next_sent_separator_index =  doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)

But I am still getting UnicodeDecodeError as below :

Traceback (most recent call last):
  File "snippetRetriver.py", line 402, in <module>
    sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
  File "snippetRetriver.py", line 201, in getSentenceList
    next_sent_separator_index =  doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)

Can anyone suggest me a suitable approach / way to avoid these kind of encoding errors in python 2.7 ?

Upvotes: 3

Views: 6579

Answers (1)

dda
dda

Reputation: 6213

codecs.utf_8_decode(input.encode('utf8'))

Upvotes: 5

Related Questions