Reputation: 129
I am trying to extract TR data from the following page: http://www.datasheetcatalog.com/catalog/p1342320.shtml
I am using requests and BeautifulSoup
. However, I don't get all rows ( only 12 instead of 22 from second table). Does anybody have an explanation for this (provided that the rows are there when printing response.content.)?
Here is the code I am using :
from bs4 import BeautifulSoup
import requests
session = requests.Session()
url = 'http://www.datasheetcatalog.com/catalog/p1342320.shtml'
response = session.get(url)
soup = BeautifulSoup(response.content,"lxml")
trs= soup.findAll('table')[8].findAll('tr')
print (len(trs))
Upvotes: 1
Views: 81
Reputation: 129
After detailed examination of the html page i found that beautifulsoup stopped after hitting comments (). So the solution is to change the parser from "lxml" to "html5lib" :
soup = BeautifulSoup(response.content,"html5lib")
Upvotes: 1
Reputation: 19154
the html is not valid which broke BeautifulSoup
here to fix
....
html_doc = response.text.replace('<table <', '<')
html_doc = re.sub(r'<\!--\s+\d+\s+--\!>', '', html_doc)
html_doc = re.sub(r'</?font.*?>' ,'', html_doc)
soup = BeautifulSoup(html_doc, "html.parser")
trs= soup.findAll('table')[8].findAll('tr')
print (len(trs))
note: using lxml
return 7 not 22
Upvotes: 0