Reputation: 21730
I'm writing a page scraper using beautiful soup, and have noticed it will sometimes try to parse a page, even though it hasn't completely loaded.
What I'm doing is something like this:
soup = BeautifulSoup(urllib.urlopen(page))
I'm not very good with Python, but I think there must be a way for me to know that the page has finished loading, so I can start scraping it.
The reason why I know it's not waiting until it's all loaded, is because the script will work most of the times, but will error some other times saying the element I'm looking for on the page isn't there (yet)
Could anyone give me a hand with this?
Upvotes: 1
Views: 1713
Reputation: 318558
Try reading everything into a string:
html = urllib.urlopen(page).read()
soup = BeautifulSoup(html)
While the BS docs say passing an open file object is fine, trying it like this is a good idea.
If it still fails it means it's not related to BS at all. In this case, print html
to see what you receive. Maybe it's just because you are not logged in to the site when accessing it from your python script or something similar.
Upvotes: 2
Reputation: 7320
Is it possible there is some JavaScript in the page you're trying to load? That might prevent it from loading fully - if it's just a plain static webpage .urlopen()
should do fine... if JavaScript is indeed the problem you can try something like PyQt4 to load the page and then extract the HTML, or use a browser like Selenium or Windmill.
Upvotes: 2