Reputation: 147
I'm trying to learn python and scrapy but I'm having problems with CrawlSpider.
The code below works for me. It takes all the links in the start url that matches the xpath - //div[@class="info"]/h3/a/@href
then passes those links to the function parse_dir_contents.
What I need now, is to get the crawler to move to the next page. I tried to use rules and linkextractor but I can't seem to make it work properly. I also tried using //a/@href
as the xpath for the parse function but it wouldn't pass the links to the parse_dir_contents function. I think I'm missing something REALLY simple. Any ideas?
class ypSpider(CrawlSpider):
name = "ypTest"
download_delay = 2
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/new-york-ny/restaurants?page=1"]
rules = [
Rule(LinkExtractor(allow=['restaurants?page=[1-2]']), callback="parse")
]
def parse(self, response):
for href in response.xpath('//div[@class="info"]/h3/a/@href'):
url = response.urljoin(href.extract())
if 'mip' in url:
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//div[@id="mip"]'):
item = ypItem()
item['url'] = response.url
item['business'] = sel.xpath('//div/div/h1/text()').extract()
---extra items here---
yield item
Edit: Here's the updated code with three functions and is able to scrape 150 items. I think it's a problem with my rules but I've tried what I think might work, but still same output.
class ypSpider(CrawlSpider):
name = "ypTest"
download_delay = 2
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/new-york-ny/restaurants?page=1"]
rules = [
Rule(LinkExtractor(allow=[r'restaurants\?page\=[1-2]']), callback='parse')
]
def parse(self, response):
for href in response.xpath('//a/@href'):
url = response.urljoin(href.extract())
if 'restaurants?page=' in url:
yield scrapy.Request(url, callback=self.parse_links)
def parse_links(self, response):
for href in response.xpath('//div[@class="info"]/h3/a/@href'):
url = response.urljoin(href.extract())
if 'mip' in url:
yield scrapy.Request(url, callback=self.parse_page)
def parse_page(self, response):
for sel in response.xpath('//div[@id="mip"]'):
item = ypItem()
item['url'] = response.url
item['business'] = sel.xpath('//div/div/h1/text()').extract()
item['phone'] = sel.xpath('//div/div/section/div/div[2]/p[3]/text()').extract()
item['street'] = sel.xpath('//div/div/section/div/div[2]/p[1]/text()').re(r'(.+)\,')
item['city'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'(.+)\,')
item['state'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'\,\s(.+)\s\d')
item['zip'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'(\d+)')
item['category'] = sel.xpath('//dd[@class="categories"]/span/a/text()').extract()
yield item
Upvotes: 0
Views: 357
Reputation: 1019
I know this is very late to answer this problem, but I managed to solve it and I am posting my answer cause it might be helpful for someone like me who was confused about how to use scrapy Rule
and LinkExtractor
in the first place.
This is my working code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ypSpider(CrawlSpider):
name = "ypTest"
allowed_domains = ["yellowpages.com"]
start_urls = ['http://www.yellowpages.com/new-york-ny/restaurants'
]
rules = (
Rule(LinkExtractor(allow=[r'restaurants\?page=\d+']), follow=True), # Scrapes all the pagination links
Rule(LinkExtractor(restrict_xpaths="//div[@class='scrollable-pane']//a[@class='business-name']"), callback='parse_item'), # Scrapes all the restaurant detail links and use `parse_item` as a callback method
)
def parse_item(self, response):
yield {
'url' : response.url
}
So, I managed to understand how Rule
and LinkExtractor
works in this scenario.
First Rule
entry is for scraping all the pagination links and the allow
parameter in LinkExtractor
function is basically using regex
to pass only those links which match the regex
. In this scenario, as per regex
, only links which contain pattern like restaurants\?page=\d+
where \d+
means one or more numbers. Also, is uses the default parse
method as a callback. (In this, I could use restrict_xpath
parameter to choose only those links which come under a specific region in HTML, and not allow
parameter but I use it to understand how it works with regex
)
Second Rule
is for fetching all restaurants detail links and parsing those using parse_item
method. Here in this Rule
, we are using restrict_xpaths
parameter, which defines regions inside the response where links should be extracted from. Here, we are fetching only those content which comes under div
with class scrollable-pane
and only those links which has class business-name
, as if you inspect the HTML, you'll find more than one links to the same restaurant with different query parameters in same div
. And in the last, we are passing our callback method parse_item
.
Now, when I run this spider, it fetches all restaurant(Restaurants in New York, NY) details which are in total 3030, in this scenario.
Upvotes: 0
Reputation: 976
CrawlSpider
uses the parse routine for its own purposes, rename your parse()
to something else, change the callback in rules[]
to match and try again.
Upvotes: 1