Reputation: 321
These are the code run in Ipython.
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
response = HtmlResponse(url='https://en.wikipedia.org/wiki/Pan_American_Games')
datas = Selector(response=response).xpath('//div[@class="thumb tleft"]')
When I execute response
I got <200 https://en.wikipedia.org/wiki/Pan_American_Games>
But when I execute reponse.body
I got ''
(NULL)
It seems like HtmlResponse
doesn't retrieve any HTML's info for this page.
Does any know how to fix this?
FYI, if I ran $ scrapy shell https://en.wikipedia.org/wiki/Pan_American_Games
in command prompt then response won't be NULL.
I don't want to do thescrapy shell url
way since I will be running for loop through the list of URL.
Thanks
Upvotes: 1
Views: 714
Reputation: 21406
The issue is that you are not writting a spider here. HtmlResponse
doesn't do any data retrieving from the internet so to say. What you have is only a response object with only the url attribute you've provided.
Here's is a great official depiction of architecture of scrapy: http://doc.scrapy.org/en/latest/topics/architecture.html?highlight=scrapy%20architecture
However if you do want to use scrapy features like selectors without scrapy spiders you can use requests
to retrieve the page and continue on with scrapy selectors
, item loaders
etc. Though this is not recommended approach since you would be missing out on all of the features scrapy has to offer.
official scrapy tutorial for beginners: http://doc.scrapy.org/en/latest/intro/tutorial.html
Upvotes: 1
Reputation: 527
Are you sure that you want to be using Scrapy for this? Because if you do you should really follow the tutorial and use a Spider. I'm pretty sure this is not the way to use Scrapy.
If you just want a basic scraper in python 2 I suggest the following:
from urllib2 import urlopen
from lxml import html
response = urlopen('https://en.wikipedia.org/wiki/Pan_American_Games')
page = html.fromstring(response.read())
datas = page.xpath('//div[@class="thumb tleft"]')
Upvotes: 0