Reputation: 536
I'm working on a second house pricing project, so I need to scrape information from one of the largest second house trading platform in China. Here's my problem, the info on the page and the corresponding element using Chrome 'inspect' function are as follows:
my code:
>>>from lxml import etree
>>>import requests
>>>url = 'http://bj.lianjia.com/chengjiao/101101498110.html'
>>>r = requests.get(url)
>>>xiaoqu_avg_price = tree.xpath('//[@id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')
>>>xiaoqu_avg_price
[]
The returned empty list is not desirable (ideally it should be 73648). Furthermore, I viewed its HTML source code, which shows:
So how should I do to get what I want? And what is the resblockCard means? Thanks.
Upvotes: 1
Views: 82
Reputation: 1549
This site like many others uses ajax for populating content. If you make a similar request you can get desired value in json format.
import requests
url = 'http://bj.lianjia.com/chengjiao/resblock?hid=101101498110&rid=1111027378082'
# Get json response
response = requests.get(url).json()
print(response['data']['resblock']['unitPrice'])
# 73648
Note two group of numbers in request url. The first group from original page url, second you can find under script
tag in the original page source: resblockId:'1111027378082'
.
Upvotes: 1
Reputation: 6302
That XPath query is not working as expected because you are running it against the source code of the page as it is served by the server, not as it looks on a rendered browser page.
One solution for this is to use Selenium in conjunction with PhantomJS or some other browser driver, which will run the JavaScript on that page and render it for you.
from selenium import webdriver
from lxml import html
driver = webdriver.PhantomJS(executable_path="<path to>/phantomjs.exe")
driver.get('http://bj.lianjia.com/chengjiao/101101498110.html')
source = driver.page_source
driver.close() # or quit() if there are no more pages to scrape
tree = html.fromstring(source)
price = tree.xpath('//div[@id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')[0].strip()
The above returns 73648 元/㎡
.
Upvotes: 0