Lyesgigs
Lyesgigs

Reputation: 129

Python requests is not extracting all elements

I am trying to extract TR data from the following page: http://www.datasheetcatalog.com/catalog/p1342320.shtml

I am using requests and BeautifulSoup. However, I don't get all rows ( only 12 instead of 22 from second table). Does anybody have an explanation for this (provided that the rows are there when printing response.content.)?

Here is the code I am using :

from bs4 import BeautifulSoup
import requests

session = requests.Session()

url = 'http://www.datasheetcatalog.com/catalog/p1342320.shtml'
response = session.get(url)

soup = BeautifulSoup(response.content,"lxml")

trs=  soup.findAll('table')[8].findAll('tr')
print (len(trs))

Upvotes: 1

Views: 81

Answers (2)

Lyesgigs
Lyesgigs

Reputation: 129

After detailed examination of the html page i found that beautifulsoup stopped after hitting comments (). So the solution is to change the parser from "lxml" to "html5lib" :

soup = BeautifulSoup(response.content,"html5lib")

Upvotes: 1

ewwink
ewwink

Reputation: 19154

the html is not valid which broke BeautifulSoup here to fix

....
html_doc = response.text.replace('<table <', '<')
html_doc = re.sub(r'<\!--\s+\d+\s+--\!>', '', html_doc)
html_doc = re.sub(r'</?font.*?>' ,'', html_doc)
soup = BeautifulSoup(html_doc, "html.parser")

trs=  soup.findAll('table')[8].findAll('tr')
print (len(trs))

note: using lxml return 7 not 22

Upvotes: 0

Related Questions