boldbrandywine
boldbrandywine

Reputation: 322

Scrapy returning empty list for xpath

I am using Scrapy to get abstracts from openreview urls. For example, I want to get the abstract from http://openreview.net/forum?id=Bk0FWVcgx, and upon doing

$ scrapy shell "http://openreview.net/forum?id=Bk0FWVcgx" $ response.xpath('//span[@class="note_content_value"]').extract()

I get back []. In addition, when I do view(response) I am lead to a blank site file:///var/folders/1j/_gkykr316td7f26fv1775c3w0000gn/T/tmpBehKh8.html.

Further, inspecting the openreview webpage shows me there are script elements, which I've never seen before. When I call

response.xpath(//script).extract() I get things back like u'<script src="static/libs/search.js"></script>' for example.

I've read a little bit about this having something to do with javascript, but I'm kind of a beginner with Scrapy and unsure how to bypass this and get what I want.

Upvotes: 0

Views: 1272

Answers (1)

furas
furas

Reputation: 143097

I found that page uses JavaScript/AJAX to load all information from address
http://openreview.net/notes?forum=Bk0FWVcgx&trash=true

But it needs two cookies to get access to this information. First server sends cookie GCLB. Later page loads http://openreview.net/token and gets second cookie openreview:sid. After that page can load JSON data.

It is working example with requests

import requests

s = requests.Session()

# to get `GCLB` cookie
r = s.get('http://openreview.net/forum?id=Bk0FWVcgx')
print(r.cookies)

# to get `openreview:sid` cookie
r = s.get('http://openreview.net/token')
print(r.cookies)

# to get JSON data
r = s.get('http://openreview.net/notes?forum=Bk0FWVcgx&trash=true')
data = r.json()
print(data['notes'][0]['content']['title'])

Other solution: use Selenium or other tool to run JavaScript code and then you can get full HTML with all information. Scrapy probably can use Seleniu or PhantomJS to run JavaScript. But I newer try it with Scrapy.

Upvotes: 1

Related Questions