Reputation: 2100
I want to crawl pages related to Disney on bloomberg websites. The url follow pattern as
"http://bloomberg.com/news/2013-07-08/disney-welcometohomepageofdisney"
So, i have written below rule for it
rules = [
Rule(SgmlLinkExtractor(allow=('/news/*/disney*',)), follow=True),
]
but the above rule doesn't working as i want and i am getting crawled pages output not related to Disney. please help to fix this rule.
Upvotes: 1
Views: 2930
Reputation: 298246
/news/*
matches /news
followed by any number of /
.
The correct regex would be:
/news/.*/disney
Upvotes: 3
Reputation: 18803
You likely need the following regex:
/news/[^/]+/disney.*
which escaped looks like
\/news\/[^\/]+\/disney.*
this way you will find the next / but not anything.
Upvotes: 1