Reputation: 15
When I use XPath to crawl and parse the content of Tencent commonweal, all the returned lists are empty. The following below is my code(The information of headers is hidden).And the target url is https://gongyi.qq.com/succor/project_list.htm#s_tid=75.I would appreciate it if someone could help me solve this problem.
import requests
import os
from lxml import etree
if __name__ =='__main__':
url = 'https://gongyi.qq.com/succor/project_list.htm#s_tid=75'
headers = {
'User-Agent': XXX }
response = requests.get(url=url,headers=headers)
page_text = response.text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="pro_main"]//li')
for li in li_list:
title = li.xpath('./div[2]/div/a/text()')[0]
print(title)
Upvotes: 1
Views: 977
Reputation: 2304
So what is actually happening here is that you can only access the first ul
inside the pro_main
div, because all those li
items and their parent are populated by JavaScript, thus your list won't be there by the time you scrape the html with requests.get()
, it will be empty!
The good news is that the JS script in questions populates the data using an API and just exactly how the website does it, you may as well retrieve those titles using the actual API and print them.
import requests, json
import os
if __name__ =='__main__':
url = 'https://ssl.gongyi.qq.com/cgi-bin/WXSearchCGI?ptype=stat&s_status=1&s_tid=75'
resp = requests.get(url).text
resp = resp[1:-1] #Result is wrapped in (), so we get rid of those
jj = json.loads(resp)
for i in jj["plist"]:
title = i["title"]
print(title)
You can explore the API by printing jj
to see if there's more info that you may need later!
Let me know if it works for you!
Upvotes: 1