Chae
Chae

Reputation: 193

Python - BeautifulSoup error while scraping

UPDATE: Using lxml instead of html.parser helped solve the problem, as Freddier suggested in the answer below!

I am trying to webscrape some information off of this website: https://www.ticketmonster.co.kr/deal/952393926.

I get an error when I run soup(thispage, 'html.parser) but this error only happens for this specific page. Does anyone know why this is happening?

The code I have so far is very simple:

from bs4 import BeautifulSoup as soup

openU = urlopen(url)
thispage = openU.read()
open.close()

pageS = soup(thispage, 'html.parser')

The error I get is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py", line 228, in __init__
    self._feed()
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site- packages\bs4\__init__.py", line 289, in _feed
    self.builder.feed(self.markup)
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 215, in feed
    parser.feed(markup)
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 111, in feed
    self.goahead(0)
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 149, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 391, in _scan_name
    % rawdata[declstartpos:declstartpos+20])
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 34, in error
    "subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()

Please help!

Upvotes: 1

Views: 1638

Answers (1)

Freddy
Freddy

Reputation: 879

Try using

pageS = soup(thispage, 'lxml')

insted of

pageS = soup(thispage, 'html.parser')

It looks may be a problem with characters encoding using "html.parser"

Upvotes: 2

Related Questions