Reputation: 141
I am trying to make my scrapy spider deny .com domains. What is the correct string to pass to deny_domains? I have tried "*.com" but it does not work.
Question UPDATE: How can i do the other way around? For example if i only want to scrape .com domains
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from myproject.items import MyprojectItem
class pformSpider(CrawlSpider):
name = "pform6"
start_urls = [
"http://example.se",
]
extractor = SgmlLinkExtractor(deny_domains=("*.com"))
rules = (
Rule(extractor,callback='parse_links',follow=True),
)
def parse_links(self, response):
item = MyprojectItem()
item['url'] = response.url
yield item
Upvotes: 0
Views: 1546
Reputation: 1
Based on the documentation I'd say you need to do something like this:
extractor = SgmlLinkExtractor(allow="*.com")
Note: I didn't test this.
Parameters: allow (str or list) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
Upvotes: 0
Reputation: 9430
You can use scrapy.linkextractors
From http://doc.scrapy.org/en/latest/topics/link-extractors.html
deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
But
deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted).
So you can use a regex with "deny" I guess something like
".*\.com\/.*"
But it may match elsewhere in the URL.
Upvotes: 3
Reputation: 18799
from scrapy.linkextractors import LinkExtractor
...
rules=(
Rule(LinkExtractor(deny=('.+\.com', ))),
)
Upvotes: 3