codeer
codeer

Reputation: 141

How to make Scrapy spider to deny country domains

I am trying to make my scrapy spider deny .com domains. What is the correct string to pass to deny_domains? I have tried "*.com" but it does not work.

Question UPDATE: How can i do the other way around? For example if i only want to scrape .com domains

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from myproject.items import MyprojectItem

class pformSpider(CrawlSpider):
    name = "pform6"

    start_urls = [
        "http://example.se",
    ]

    extractor = SgmlLinkExtractor(deny_domains=("*.com"))

    rules = (
        Rule(extractor,callback='parse_links',follow=True),
        )

    def parse_links(self, response):
        item = MyprojectItem()
        item['url'] = response.url
        yield item

Upvotes: 0

Views: 1546

Answers (3)

Casper
Casper

Reputation: 1

Based on the documentation I'd say you need to do something like this:

extractor = SgmlLinkExtractor(allow="*.com")

Note: I didn't test this.

Parameters: allow (str or list) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.

Upvotes: 0

Dan-Dev
Dan-Dev

Reputation: 9430

You can use scrapy.linkextractors

From http://doc.scrapy.org/en/latest/topics/link-extractors.html

deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links

But

deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted).

So you can use a regex with "deny" I guess something like

".*\.com\/.*"

But it may match elsewhere in the URL.

Upvotes: 3

eLRuLL
eLRuLL

Reputation: 18799

from scrapy.linkextractors import LinkExtractor
...
    rules=(
        Rule(LinkExtractor(deny=('.+\.com', ))),
    )

Upvotes: 3

Related Questions