Using Python requests.get to parse html code that does not load at once

Question

I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.

import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[@class='product-soldout ng-scope']")

at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are 1: Am I correct in my assessment of the problem? and 2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.

Thanks

Edit: Thanks to both responses. I used Selenium and got my script working.

abarnert · Accepted Answer

You are not correct in your assessment of the problem.

You can check the results and see that there's a right near the end. That means you've got the whole page.

And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.

Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.

There are a number of general solutions to that. For example:

Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.

Using Python requests.get to parse html code that does not load at once

Answers (2)

Related Questions