Reputation: 27
As a newbie in scrapy and python, I'm struggling with the deny rules of my Crawl Spider. I want to filter all URLs on my target page, which contain the word "versicherung" and the double ? structure in any part of the URL. However, scrapy ignores my rule. Can anyone tell me what's wrong with the syntax (I've already tried without the "" before the *, but that doesn't work either)?
Rule:
rules = [Rule(LinkExtractor(deny=r'\*versicher\*', r\'*\?*\?\*',),
callback='parse_norisbank', follow=True)]
Log:
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/rechtsschutzversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/haftpflichtversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/hausratversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/versicherungsmanager.html> (referer: https://www.norisbank.de)
DEBUG: Saved file nbtest-versicherungen.html
Upvotes: 1
Views: 343
Reputation: 89172
The rules must be regular expressions and (even if I correct your syntax) you are not using *
correctly.
r'\*versicher\*'
should be r'.*versicher.*'
EDIT: looking at scrapy docs, it looks like r'versicher'
is sufficient.
I don't understand what you mean by "double ? structure", but your URLs don't seem to have it.
I expect r'.*\?\?.*'
is what you want (or r'\?\?'
)
In regular expressions
.
means any character*
means 0 or more of the preceding (so .*
matches anything)\\
is how you escape a special character. You don't want to escape the *
since you want it to act in its special way.Upvotes: 2