Scrapy ignores deny rule

Question

As a newbie in scrapy and python, I'm struggling with the deny rules of my Crawl Spider. I want to filter all URLs on my target page, which contain the word "versicherung" and the double ? structure in any part of the URL. However, scrapy ignores my rule. Can anyone tell me what's wrong with the syntax (I've already tried without the "" before the *, but that doesn't work either)?

Rule:

rules = [Rule(LinkExtractor(deny=r'\*versicher\*', r\'*\?*\?\*',),
            callback='parse_norisbank', follow=True)]

Log:

DEBUG: Crawled (200)  (referer: https://www.norisbank.de)
DEBUG: Crawled (200)  (referer: https://www.norisbank.de)
DEBUG: Crawled (200)  (referer: https://www.norisbank.de)
DEBUG: Crawled (200)  (referer: https://www.norisbank.de)
DEBUG: Saved file nbtest-versicherungen.html

Lou Franco · Accepted Answer

The rules must be regular expressions and (even if I correct your syntax) you are not using * correctly.

r'\*versicher\*' should be r'.*versicher.*' EDIT: looking at scrapy docs, it looks like r'versicher' is sufficient.

I don't understand what you mean by "double ? structure", but your URLs don't seem to have it.

I expect r'.*\?\?.*' is what you want (or r'\?\?')

In regular expressions

. means any character
* means 0 or more of the preceding (so .* matches anything)
\ is how you escape a special character. You don't want to escape the * since you want it to act in its special way.

Scrapy ignores deny rule

Answers (1)

Related Questions