Reputation: 12409
I have a url of the form:
example.com/foo/bar/page_1.html
There are a total of 53 pages, each one of them has ~20 rows.
I basically want to get all the rows from all the pages, i.e. ~53*20 items.
I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:
def parse(self, response):
hxs = HtmlXPathSelector(response)
restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')
for rest in restaurants:
item = DegustaItem()
item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
# some items don't have category associated with them
try:
item['category'] = rest.select('td[3]/a/text()').extract()[0]
except:
item['category'] = ''
item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]
# get profile url
rel_url = rest.select('td[2]/a/@href').extract()[0]
# join with base url since profile url is relative
base_url = get_base_url(response)
follow = urljoin_rfc(base_url,rel_url)
request = Request(follow, callback = parse_profile)
request.meta['item'] = item
return request
def parse_profile(self, response):
item = response.meta['item']
# item['address'] = figure out xpath
return item
The question is, how do I crawl each page?
example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html
Upvotes: 30
Views: 27884
Reputation: 8653
There can be two use cases for 'scrapy - parsing items that are paginated'.
A). We just want to move across the table and fetch data. This is relatively straight forward.
class TrainSpider(scrapy.Spider):
name = "trip"
start_urls = ['somewebsite']
def parse(self, response):
''' do something with this parser '''
next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Observe the last 4 lines. Here
parse
call back method. B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.
class StationDetailSpider(CrawlSpider):
name = 'train'
start_urls = [someOtherWebsite]
rules = (
Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
)
def parse_trains(self, response):
'''do your parsing here'''
Overhere, observe that:
We are using the CrawlSpider
subclass of the scrapy.Spider
parent class
We have set to 'Rules'
a) The first rule, just checks if there is a 'next_page' available and follows it.
b) The second rule requests for all the links on a page that are in the format, say /trains/12343
and then calls the parse_trains
to perform and parsing operation.
Important: Note that we don't want to use the regular parse
method over here as we are using CrawlSpider
subclass. This class also has a parse
method so we don't want to override that. Just remember to name your call back method something other than parse
.
Upvotes: 10
Reputation: 1145
You could use the CrawlSpider instead of the BaseSpider and use SgmlLinkExtractor to extract the pages in the pagination.
For instance:
start_urls = ["www.example.com/page1"]
rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
, follow= True),
Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))
, callback='parse_call')
)
The first rule tells scrapy to follow the link contained in the xpath expression, the second rule tells scrapy to call the parse_call to links contained in the xpath expression, in case you want to parse something in each page.
For more info please see the doc: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
Upvotes: 12
Reputation: 15712
You have two options to solve your problem. The general one is to use yield
to generate new requests instead of return
. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.
In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:
class MySpider(BaseSpider):
start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]
Upvotes: 49