Reputation: 59
I've run this web scraping exercise using the requests and BeautifulSoup module in python 2.7.12. My problem is that I can't seem to get the soup object to return a specific tr based on the id, as well as a few other html elements with id that I've picked at random including the ones in the below print statements. Any idea why that's not working? Any help would be greatly appreciated.
import requests
from bs4 import BeautifulSoup as bs
head= {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',}
r = requests.get('http://www.iii.co.uk/investment/detail?code=cotn:LSE:SEE&display=discussion', headers=head)
r_text = r.text
soup = bs(r_text, "html.parser")
print soup.find("tr",id="disc1-12056888")
print soup.find('table', id='discussion-list')
Upvotes: 0
Views: 1154
Reputation: 12178
I believe html.parser
is unstable is python2, use lxml
or html5lib
soup = bs(r_text, "lxml")
This quote is from Document:
If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.
Upvotes: 2
Reputation: 3279
@AndrewF:
I'd suggest you to use PyQuery
for simpler tasks as extract comments, here is a snippet to show the simplicity of it:
import requests
import pyquery
head= {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',}
r = requests.get('http://www.iii.co.uk/investment/detail?code=cotn:LSE:SEE&display=discussion', headers=head)
r_text = r.text
pq = pyquery.PyQuery(r_text)
for a in pq('tr.comment div'):
if a.text.strip():
print(a.text.strip())
Upvotes: 1