Armin Abele
Armin Abele

Reputation: 27

Scrapy ignores deny rule

As a newbie in scrapy and python, I'm struggling with the deny rules of my Crawl Spider. I want to filter all URLs on my target page, which contain the word "versicherung" and the double ? structure in any part of the URL. However, scrapy ignores my rule. Can anyone tell me what's wrong with the syntax (I've already tried without the "" before the *, but that doesn't work either)?

Rule:

rules = [Rule(LinkExtractor(deny=r'\*versicher\*', r\'*\?*\?\*',),
            callback='parse_norisbank', follow=True)]

Log:

DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/rechtsschutzversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/haftpflichtversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/hausratversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/versicherungsmanager.html> (referer: https://www.norisbank.de)
DEBUG: Saved file nbtest-versicherungen.html

Upvotes: 1

Views: 343

Answers (1)

Lou Franco
Lou Franco

Reputation: 89172

The rules must be regular expressions and (even if I correct your syntax) you are not using * correctly.

r'\*versicher\*' should be r'.*versicher.*' EDIT: looking at scrapy docs, it looks like r'versicher' is sufficient.

I don't understand what you mean by "double ? structure", but your URLs don't seem to have it.

I expect r'.*\?\?.*' is what you want (or r'\?\?')

In regular expressions

  • . means any character
  • * means 0 or more of the preceding (so .* matches anything)
  • \\ is how you escape a special character. You don't want to escape the * since you want it to act in its special way.

Upvotes: 2

Related Questions