Jiahui Zhang
Jiahui Zhang

Reputation: 536

failing to retrieve text from html using lxml and xpath

I'm working on a second house pricing project, so I need to scrape information from one of the largest second house trading platform in China. Here's my problem, the info on the page and the corresponding element using Chrome 'inspect' function are as follows:

enter image description here

my code:

>>>from lxml import etree
>>>import requests
>>>url = 'http://bj.lianjia.com/chengjiao/101101498110.html'
>>>r = requests.get(url)
>>>xiaoqu_avg_price = tree.xpath('//[@id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')
>>>xiaoqu_avg_price
[]

The returned empty list is not desirable (ideally it should be 73648). Furthermore, I viewed its HTML source code, which shows:

enter image description here

So how should I do to get what I want? And what is the resblockCard means? Thanks.

Upvotes: 1

Views: 82

Answers (2)

vold
vold

Reputation: 1549

This site like many others uses ajax for populating content. If you make a similar request you can get desired value in json format.

import requests

url = 'http://bj.lianjia.com/chengjiao/resblock?hid=101101498110&rid=1111027378082'
# Get json response
response = requests.get(url).json()
print(response['data']['resblock']['unitPrice'])
# 73648

Note two group of numbers in request url. The first group from original page url, second you can find under script tag in the original page source: resblockId:'1111027378082'.

Upvotes: 1

Martin Valgur
Martin Valgur

Reputation: 6302

That XPath query is not working as expected because you are running it against the source code of the page as it is served by the server, not as it looks on a rendered browser page.

One solution for this is to use Selenium in conjunction with PhantomJS or some other browser driver, which will run the JavaScript on that page and render it for you.

from selenium import webdriver
from lxml import html

driver = webdriver.PhantomJS(executable_path="<path to>/phantomjs.exe")
driver.get('http://bj.lianjia.com/chengjiao/101101498110.html')
source = driver.page_source
driver.close()  # or quit() if there are no more pages to scrape

tree = html.fromstring(source)
price = tree.xpath('//div[@id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')[0].strip()

The above returns 73648 元/㎡.

Upvotes: 0

Related Questions