devon
devon

Reputation: 321

HtmlResponse from Scrapy doesn't retrieve the data from URL

These are the code run in Ipython.

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

response = HtmlResponse(url='https://en.wikipedia.org/wiki/Pan_American_Games')
datas = Selector(response=response).xpath('//div[@class="thumb tleft"]')

When I execute response I got <200 https://en.wikipedia.org/wiki/Pan_American_Games> But when I execute reponse.body I got '' (NULL)

It seems like HtmlResponse doesn't retrieve any HTML's info for this page.

Does any know how to fix this?

FYI, if I ran $ scrapy shell https://en.wikipedia.org/wiki/Pan_American_Games in command prompt then response won't be NULL. I don't want to do thescrapy shell urlway since I will be running for loop through the list of URL.

Thanks

Upvotes: 1

Views: 714

Answers (2)

Granitosaurus
Granitosaurus

Reputation: 21406

The issue is that you are not writting a spider here. HtmlResponse doesn't do any data retrieving from the internet so to say. What you have is only a response object with only the url attribute you've provided.

Here's is a great official depiction of architecture of scrapy: http://doc.scrapy.org/en/latest/topics/architecture.html?highlight=scrapy%20architecture

However if you do want to use scrapy features like selectors without scrapy spiders you can use requests to retrieve the page and continue on with scrapy selectors, item loaders etc. Though this is not recommended approach since you would be missing out on all of the features scrapy has to offer.

official scrapy tutorial for beginners: http://doc.scrapy.org/en/latest/intro/tutorial.html

Upvotes: 1

Ixio
Ixio

Reputation: 527

Are you sure that you want to be using Scrapy for this? Because if you do you should really follow the tutorial and use a Spider. I'm pretty sure this is not the way to use Scrapy.

If you just want a basic scraper in python 2 I suggest the following:

from urllib2 import urlopen
from lxml import html

response = urlopen('https://en.wikipedia.org/wiki/Pan_American_Games')
page = html.fromstring(response.read())
datas = page.xpath('//div[@class="thumb tleft"]')

Upvotes: 0

Related Questions