user3430235
user3430235

Reputation: 439

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I know there are a lot of questions regarding the encoding-decoding but I seem not to figure this out:

def content(title, sents):
sent_elems = []
for sent_i, sent in enumerate(sents, 1):


    elem = u"<a name=\"{i}\">[{i}]</a> <a href=\"#{i}\" id={i}>{text}</a>".format(i=sent_i, text=sent.text)
    sent_elems.append(elem)
doc = u"""<html>
<head>
<title>{title}</title>
</head>
<body>{elems}</body>
</html>""".format(title=title, elems="\n".join(sent_elems))

return doc

Calling the content function will give me this error on a very rare cases (maybe one-two times in my whole dataset):

 File "processing.py", line 68, in score_summary
self._write_config(references, summary)
  File "processing.py", line 56, in _write_config
reference_files = self._write_references(references, reference_dir)
  File "processing.py", line 44, in _write_references
f.write(rouge_summary_content(reference.id, reference.sents))
  File "processing.py", line 154, in rouge_summary_content
</html>""".format(title=title, elems="\n".join(sent_elems))
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I've change to:

sent_elems.append(elem.decode("utf-8", "ignore"))

and also

sent_elems.append(elem.decode("utf-8", "replace"))

Still the same error.

I had a look at the data and couldn't figure out why this happens. I checked for the file that this error happens and still no non-utf8 char.

I also added this in my file:

import sys
reload(sys)
sys.setdefaultencoding("utf-8") 

The problem still is there. Any suggestions?

Upvotes: 2

Views: 7352

Answers (2)

vijay
vijay

Reputation: 886

if your data looks like the one given below:

data="0\x80\x06\t*\x86H\x86\xf7\r\x01\x07\x04\xa0\x800\x80\x02\x01\x01\x0e0\x0c\x06\b*\x86H\x86\xf7\r\x02\x05\x05....."

By following the below method , we can decode it in utf8

encoded = base64.b64encode(data)
decoded = urllib.unquote(encoded).decode('utf8')

The result would be like this:

MIAGCSqGSIb3DQEHAq...

Upvotes: 0

user3430235
user3430235

Reputation: 439

My titles are as chr(65+index), so when it goes over all capital letter it will print some non utf-8 chars. I changed it to str(index) and it solved my original problem.

Upvotes: 1

Related Questions