newBike
newBike

Reputation: 15022

UnicodeDecodeError exception from reading xml file

I tried to parse xml with BeautifulSoup

    content = open(filename, encoding='utf-8').read()
    return BeautifulSoup(content)

And check the source file's codec, it told me it should be ascii

➜  worker git:(develop) ✗ chardetect ../complete_data/sample.xml                                                                    git:(develop|✚9…
../complete_data/sample.xml: ascii with confidence 1.0

However, it still breaks my program with exception,

How could I fix it, furthermore, how could I know the correct encoding in the future, and the exception message from Python is so poor

Exception

Traceback (most recent call last):
  File "parser_factory.py", line 97, in <module>
    test_shareholder_meetings()
  File "parser_factory.py", line 81, in test_shareholder_meetings
    _import_source_files(collection_name="shareholder_meetings", dataset_name="WSH_BoD_Shareholder")
  File "parser_factory.py", line 78, in _import_source_files
    parser(f, collection_name).import_data()
  File "/workspace/balala-wsh/worker/parser_base.py", line 21, in __init__
    self.soup = self.read_file_in_bs(filename)
  File "/workspace/balala-wsh/worker/parser_base.py", line 30, in read_file_in_bs
    content = open(filename, encoding='utf-8').read()
  File "/Users/sample_user/.pyenv/versions/3.4.3/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 180145: invalid continuation byte

Upvotes: 1

Views: 490

Answers (2)

tripleee
tripleee

Reputation: 189809

chardet does not examine the entire file. If it contains a lone 0xE7, it's certainly not ASCII, and apparently not UTF-8, either.

Perhaps https://tripleee.github.io/8bit#e7 can help you determine what it really is.

Upvotes: 1

galaxyan
galaxyan

Reputation: 6141

you can try 'cp1252' to decode the test.
I believe the test you are reading is not Unicode.

Upvotes: 1

Related Questions