Reputation: 315
I wonder if there is a way to get all urls in the entire website. It seems that Scrapy with CrawSpider and LinkExtractor is a good choice. Consider this example:
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class SampleItem(Item):
link = Field()
class SampleSpider(CrawlSpider):
name = "sample_spider"
allowed_domains = ["domain.com"]
start_urls = ["http://domain.com"]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
return item
This spider does not give me what I want. It only gives me all the links on a single webpage, namely, the start_url. But what I want is every link in this website, including those that are not on the start url. Did I understand the example correctly? Is there a solution to my problem? Thanks a lot!
Upvotes: 2
Views: 1860
Reputation: 4501
Export each item via a Feed Export. This will result in a list of all links found on the site.
Or, write your own Item Pipeline to export all of your links to a file, database, or whatever you choose.
Another option would be to create a spider level list to which you append each URL, instead of using items at all. How you proceed will really depend on what you need from the spider, and how you intend to use it.
Upvotes: 2
Reputation: 2636
you could create a spider that gathers all the links in a page then for each of those links, check for the domain : if it is the same, parse those links, rinse , repeat.
There's no guarantee however that you'll catch all pages of the said domain, see How to get all webpages on a domain for a good overview of the issue in my opinion.
class SampleSpider(scrapy.Spider):
name = "sample_spider"
allowed_domains = ["domain.com"]
start_urls = ["http://domain.com"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a/@href').extract()
# make sure the parsed url is the domain related.
for u in urls:
# print('response url:{} | link url: {}'.format(response.url, u))
if urlsplit(u).netloc == urlsplit(response.url).netloc:
yield scrapy.Request(u, self.parse)
Upvotes: 1