Reputation: 279

Scrapy get all external Links of URL

i use scrapy to spider an entire website (allow_domains = mydomain.com). Now i want to get all external Links (to another Domains) from the current URL. How can i integrate this in my spider.py to get a list with all external URLs?

Upvotes: 2

Answers (1)

Nima Soroush

Reputation: 12814

Try to use Link Extractors. This can be an example:

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field


class MyItem(Item):
    url= Field()


class MySpider(CrawlSpider):
    name = 'twitter.com'
    allowed_domains = ['my-domain.com']
    start_urls = ['http://www.my-domain.com']

    rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=False), )

    def parse_url(self, response):
        item = MyItem()
        item['url'] = response.url
        return item

Upvotes: 1

Scrapy get all external Links of URL

Answers (1)

Related Questions