Reputation: 1885
I am currently trying to make a simple crawler in python using Scrapey. What I want it to do is read a list of links and save the html of the websites that they link to. Right now, I am able to get all of the URLs, but I am unable to figure out how to download the page. Here is the code for my spider so far:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BookItem
# Book scrappy spider
class DmozSpider(BaseSpider):
name = "book"
allowed_domains = ["learnpythonthehardway.org"]
start_urls = [
"http://www.learnpythonthehardway.org/book/",
]
def parse(self, response):
filename = response.url.split("/")[-2]
file = open(filename,'wb')
file.write(response.body)
file.close()
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = BookItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
items.append(item)
return items
Upvotes: 1
Views: 1627
Reputation: 179687
In your parse
method, return Request
objects in the list of returned items to trigger downloads:
for site in sites:
...
items.append(item)
items.append(Request(item['link']), callback=self.parse)
This will cause the crawler to produce a BookItem
for each link, but also recurse and download each book's page. Of course, you can specify a different callback (e.g. self.parsebook
) if you want to parse the subpages differently.
Upvotes: 1