Reputation: 61
I am trying to scrape the store location data from the Subway UK Restaurant Finder using python and Scrapy. I have managed to scrape individual pages, but I would like to set it up to run through a list of, say, 1000 recursive id's at the end of the link. Any help would be appreciated.
Disclaimer: I don't know what I'm doing
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from subway.items import SubwayFinder
class MySpider(BaseSpider):
name = "subway"
allowed_domains = ["http://www.subway.co.uk/"]
start_urls = ["http://www.subway.co.uk/business/storefinder/store-detail.aspx?id=453056039"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//div[@class='mid']")
items = []
for titles in titles:
item = SubwayFinder()
item ["title"] = titles.select("p/span/text()").extract()
items.append(item)
return items
Upvotes: 0
Views: 1088
Reputation: 3066
Instead of BaseSpider
, you can use CrawlSpider
Check out this link for using crawlspiders.
You will need to define rules for scrapy in order to crawl through the webpages. These rules will define the websites and links that you want scrapy to allow for scraping.
You can check this example for a sample crawl spider about the structure
btw, consider changing function name, from docs:
Warning
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
Upvotes: 1
Reputation: 11396
as shown in your code, a spider function can return (or yield) items, but it can also return / yield Requests, scrapy will send items to configured pipelines and call those requests for further scraping, take a look at Request fields, the callback function is the one that will be called with the response.
in order to scrape multiple store locations you'll have to look for a url pattern or an index page that have links to all stores.
for example:
http://www.subway.co.uk/business/storefinder/store-detail.aspx?id=453056039
does not look like a good candidate for looping over all store ids, calling 453056039 http requests may not be a good idea.
I couldn't find an index page on the site, the closest to that could be to have start_urls set to 'www.subway.co.uk/business/storefinder/search.aspx?pc=' + range(1,10)
or some other numbers that will be proven better, and further crawl the links appear on each page, also note that luckily scrapy will not scrape a page twice (unless told to) so a store details page that appears in more than one index page is not a problem
Upvotes: 1