How to use LinkExtractor to get all urls in a website?

Question

I wonder if there is a way to get all urls in the entire website. It seems that Scrapy with CrawSpider and LinkExtractor is a good choice. Consider this example:

from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class SampleItem(Item):
    link = Field()


class SampleSpider(CrawlSpider):
    name = "sample_spider"
    allowed_domains = ["domain.com"]
    start_urls = ["http://domain.com"]

    rules = (
        Rule(LinkExtractor(), callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        item = SampleItem()
        item['link'] = response.url
        return item

This spider does not give me what I want. It only gives me all the links on a single webpage, namely, the start_url. But what I want is every link in this website, including those that are not on the start url. Did I understand the example correctly? Is there a solution to my problem? Thanks a lot!

euri10 · Accepted Answer

you could create a spider that gathers all the links in a page then for each of those links, check for the domain : if it is the same, parse those links, rinse , repeat.

There's no guarantee however that you'll catch all pages of the said domain, see How to get all webpages on a domain for a good overview of the issue in my opinion.

class SampleSpider(scrapy.Spider):
name = "sample_spider"
allowed_domains = ["domain.com"]
start_urls = ["http://domain.com"]


def parse(self, response):
    hxs = HtmlXPathSelector(response)

    urls = hxs.select('//a/@href').extract()

    # make sure the parsed url is the domain related.
    for u in urls:
        # print('response url:{} | link url: {}'.format(response.url, u))
        if urlsplit(u).netloc == urlsplit(response.url).netloc:
            yield scrapy.Request(u, self.parse)

How to use LinkExtractor to get all urls in a website?

Answers (2)

Related Questions