mightycommander
mightycommander

Reputation: 61

Scraping recursive page data with Scrapy

I am trying to scrape the store location data from the Subway UK Restaurant Finder using python and Scrapy. I have managed to scrape individual pages, but I would like to set it up to run through a list of, say, 1000 recursive id's at the end of the link. Any help would be appreciated.

Disclaimer: I don't know what I'm doing

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from subway.items import SubwayFinder

class MySpider(BaseSpider):
name = "subway"
allowed_domains = ["http://www.subway.co.uk/"]
start_urls = ["http://www.subway.co.uk/business/storefinder/store-detail.aspx?id=453056039"]

def parse(self, response):
  hxs = HtmlXPathSelector(response)
  titles = hxs.select("//div[@class='mid']")
  items = []
  for titles in titles:
      item = SubwayFinder()
      item ["title"] = titles.select("p/span/text()").extract()
      items.append(item)
  return items

Upvotes: 0

Views: 1088

Answers (2)

Abhishek
Abhishek

Reputation: 3066

Instead of BaseSpider, you can use CrawlSpider

Check out this link for using crawlspiders.

You will need to define rules for scrapy in order to crawl through the webpages. These rules will define the websites and links that you want scrapy to allow for scraping.

You can check this example for a sample crawl spider about the structure

btw, consider changing function name, from docs:

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Upvotes: 1

Guy Gavriely
Guy Gavriely

Reputation: 11396

as shown in your code, a spider function can return (or yield) items, but it can also return / yield Requests, scrapy will send items to configured pipelines and call those requests for further scraping, take a look at Request fields, the callback function is the one that will be called with the response.

in order to scrape multiple store locations you'll have to look for a url pattern or an index page that have links to all stores.

for example:

http://www.subway.co.uk/business/storefinder/store-detail.aspx?id=453056039

does not look like a good candidate for looping over all store ids, calling 453056039 http requests may not be a good idea.

I couldn't find an index page on the site, the closest to that could be to have start_urls set to 'www.subway.co.uk/business/storefinder/search.aspx?pc=' + range(1,10) or some other numbers that will be proven better, and further crawl the links appear on each page, also note that luckily scrapy will not scrape a page twice (unless told to) so a store details page that appears in more than one index page is not a problem

Upvotes: 1

Related Questions