Problem in parsing the XML file: xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 19, column 175

Question

I need to parse an XML file with multiple roots but I am unable to read the file. I get an error

Traceback (most recent call last):
  File "C:/Users/Abhi/PycharmProjects/Trec_project/Index_with_Xml.py", line 34, in 
    root = ET.fromstringlist(complete)
  File "C:\Users\Abhi\Anaconda3\lib\xml\etree\ElementTree.py", line 1355, in fromstringlist
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 19, column 175

The same code is working for other XML files. I change the encoding style but it didn't work. I need to parse each element of the file. The data has multiple roots.

Here is the sample data


WAPO_b2e89334-33f9-11e1-825f-dabc29fd7071-1

https://www.washingtonpost.com/sports/colleges/danny-coale-jarrett-boykin-are-a-perfect-1-2-punch-for-virginia-tech/2011/12/31/gIQAAaW4SP_story.html



NEW ORLEANS — Whenever a Virginia Tech offensive coach is asked how the most prolific receiving duo in school history came to be, inevitably the first road game in 2008 against North Carolina comes up.




WAPO_b2e89334-33f9-11e1-825f-dabc29fd7071-2

https://www.washingtonpost.com/sports/colleges/danny-coale-jarrett-boykin-are-a-perfect-1-2-punch-for-virginia-tech/2011/12/31/gIQAAaW4SP_story.html



Midway through the first quarter, Virginia Tech had to call two timeouts in a row because then-freshmen Jarrett Boykin and Danny Coale couldn’t seem to line up right, and “they had those big eyes out there looking around,” Kevin Sherman, their position coach, said recently.




WAPO_b2e89334-33f9-11e1-825f-dabc29fd7071-3

https://www.washingtonpost.com/sports/colleges/danny-coale-jarrett-boykin-are-a-perfect-1-2-punch-for-virginia-tech/2011/12/31/gIQAAaW4SP_story.html



Now that Boykin and Coale have only Tuesday’s Sugar Bowl remaining before leaving Virginia Tech with every major school record for a wide receiver, they’ve taken a different stance.




WAPO_b2e89334-33f9-11e1-825f-dabc29fd7071-4

https://www.washingtonpost.com/sports/colleges/danny-coale-jarrett-boykin-are-a-perfect-1-2-punch-for-virginia-tech/2011/12/31/gIQAAaW4SP_story.html



“I still don’t think that was on us. Macho [Harris] was in the game and he lined up wrong,” said Boykin, as Coale sat next to him nodding in agreement.

The code

import xml.etree.ElementTree as ET
# import xml.etree.cElementTree as ET

with open(path, encoding='utf-8-sig',errors='ignore') as f:
    #it = itertools.chain('', f, '')
    data=f.read()
    complete="" + data + ""
    #fixed = it.replace(b'\x0c', b'')
    root = ET.fromstringlist(complete)

# Do something with `root`
for x in root:
    print(x[0].text)
    print(x[2].text)

Problem in parsing the XML file: xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 19, column 175

The code

Answers (1)

Related Questions