brophi
brophi

Reputation: 33

Using Python requests.get to parse html code that does not load at once

I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.

import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[@class='product-soldout ng-scope']")

at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are 1: Am I correct in my assessment of the problem? and 2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.

Thanks

Edit: Thanks to both responses. I used Selenium and got my script working.

Upvotes: 3

Views: 12961

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180401

The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:

from selenium import webdriver

browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)

Upvotes: 3

abarnert
abarnert

Reputation: 365707

You are not correct in your assessment of the problem.

You can check the results and see that there's a </html> right near the end. That means you've got the whole page.

And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.

Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.

There are a number of general solutions to that. For example:

  • Use selenium or similar to drive an actual browser to download the page.
  • Manually work out what the JavaScript code does and do equivalent work in Python.
  • Run a headless JavaScript interpreter against a DOM that you've built up.

Upvotes: 9

Related Questions