Marcos Placona
Marcos Placona

Reputation: 21730

Check if python urlopen has finished loading

I'm writing a page scraper using beautiful soup, and have noticed it will sometimes try to parse a page, even though it hasn't completely loaded.

What I'm doing is something like this:

soup = BeautifulSoup(urllib.urlopen(page))

I'm not very good with Python, but I think there must be a way for me to know that the page has finished loading, so I can start scraping it.

The reason why I know it's not waiting until it's all loaded, is because the script will work most of the times, but will error some other times saying the element I'm looking for on the page isn't there (yet)

Could anyone give me a hand with this?

Upvotes: 1

Views: 1713

Answers (2)

ThiefMaster
ThiefMaster

Reputation: 318558

Try reading everything into a string:

html = urllib.urlopen(page).read()
soup = BeautifulSoup(html)

While the BS docs say passing an open file object is fine, trying it like this is a good idea. If it still fails it means it's not related to BS at all. In this case, print html to see what you receive. Maybe it's just because you are not logged in to the site when accessing it from your python script or something similar.

Upvotes: 2

adelbertc
adelbertc

Reputation: 7320

Is it possible there is some JavaScript in the page you're trying to load? That might prevent it from loading fully - if it's just a plain static webpage .urlopen() should do fine... if JavaScript is indeed the problem you can try something like PyQt4 to load the page and then extract the HTML, or use a browser like Selenium or Windmill.

Upvotes: 2

Related Questions