Reputation: 322
I am using Scrapy to get abstracts from openreview urls. For example, I want to get the abstract from http://openreview.net/forum?id=Bk0FWVcgx, and upon doing
$ scrapy shell "http://openreview.net/forum?id=Bk0FWVcgx"
$ response.xpath('//span[@class="note_content_value"]').extract()
I get back []
. In addition, when I do view(response)
I am lead to a blank site file:///var/folders/1j/_gkykr316td7f26fv1775c3w0000gn/T/tmpBehKh8.html
.
Further, inspecting the openreview webpage shows me there are script elements, which I've never seen before. When I call
response.xpath(//script).extract()
I get things back like u'<script src="static/libs/search.js"></script>'
for example.
I've read a little bit about this having something to do with javascript, but I'm kind of a beginner with Scrapy and unsure how to bypass this and get what I want.
Upvotes: 0
Views: 1272
Reputation: 143097
I found that page uses JavaScript/AJAX to load all information from address
http://openreview.net/notes?forum=Bk0FWVcgx&trash=true
But it needs two cookies to get access to this information. First server sends cookie GCLB
. Later page loads http://openreview.net/token and gets second cookie openreview:sid
. After that page can load JSON data.
It is working example with requests
import requests
s = requests.Session()
# to get `GCLB` cookie
r = s.get('http://openreview.net/forum?id=Bk0FWVcgx')
print(r.cookies)
# to get `openreview:sid` cookie
r = s.get('http://openreview.net/token')
print(r.cookies)
# to get JSON data
r = s.get('http://openreview.net/notes?forum=Bk0FWVcgx&trash=true')
data = r.json()
print(data['notes'][0]['content']['title'])
Other solution: use Selenium
or other tool to run JavaScript code and then you can get full HTML with all information. Scrapy
probably can use Seleniu
or PhantomJS
to run JavaScript. But I newer try it with Scrapy
.
Upvotes: 1