xml.etree.ElementTree.ParseError: not well-formed (invalid token) due to "

Question

I'm trying to parse web page to save some data from it in excel or csv file.

import urllib.request
import xml.etree.ElementTree as ET

url = "http://rusdrama.com/afisha"
response = urllib.request.urlopen(url)
content = response.read()
root = ET.fromstring(content)

When parsing page using fromstring method ElementTree I got the following error:

Traceback (most recent call last):
  File "D:/PythonProjects/PythonMisc/theater_reader.py", line 7, in 
    root = ET.fromstring(content)
  File "D:\Python\Python35\lib\xml\etree\ElementTree.py", line 1333, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 49, column 14

The part of received page is the following:

And specifically line 49:

    if (scroll <= 100) {

So the problem is in opening angle bracket that seems to be processed as opening tag symbol. I saw several similar questions but can't understand how to handle this situation.

alecxe · Accepted Answer

You are trying to parse HTML with an XML parser. Use a proper tool, an HTML Parser, instead: BeautifulSoup or lxml.html are the most popular.

Demo:

>>> from bs4 import BeautifulSoup
>>> import urllib.request
>>> 
>>> url = "http://rusdrama.com/afisha"
>>> response = urllib.request.urlopen(url)
>>>
>>> soup = BeautifulSoup(response, "html.parser")
>>> print(soup.title.get_text())
Афиша Харьковского академического русского драматического театра Пушкина

xml.etree.ElementTree.ParseError: not well-formed (invalid token) due to "<" symbol in script

Answers (1)

Related Questions

xml.etree.ElementTree.ParseError: not well-formed (invalid token) due to &quot;&lt;&quot; symbol in script

Answers (1)

Related Questions

xml.etree.ElementTree.ParseError: not well-formed (invalid token) due to "<" symbol in script