BeautifulSoup Parses Table Incorrectly

Question

Having trouble getting Beautiful Soup to process a large table of play-by-play basketball data properly. Code:

import urllib.request
from bs4 import BeautifulSoup

request = urllib.request.Request('http://www.basketball-reference.com/boxscores/pbp/201611220LAL.html')
result = urllib.request.urlopen(request)
resulttext = result.read()
soup = BeautifulSoup(resulttext, "html.parser")

pbpTable = soup.find('table', id="pbp")

If you run this example yourself, you will find that the table is not fully parsed- all we get out is this:


Play-By-Play Table

1st Q

The problem is in the parsing itself printing the soup variable gives (among other things)





Play-By-Play
 
   Jump to: 1st | 2nd | 3rd | 4th 
 scoring play tie lead change



 

Play-By-Play Table
1st Q

Most importantly, a /table tag appears out of nowhere. Viewing the page source of the relevant link we can see that the table is not closed there- it goes on for a while. Is there any fix for this besides implementing my own HTML parsing code?

furas · Accepted Answer

Use "lxml" or "html5lib" instead of "html.parser" in

soup = BeautifulSoup(resulttext, "lxml")`

and you get more data.

But you may have to install lxml or html5lib if you don't have yet.

pip install lxml

pip install html5lib

lxml may need C/C++ compiler, libxml library (libxml.dll on Windows), etc.

BeautifulSoup Parses Table Incorrectly

Answers (1)

Related Questions