Rahul
Rahul

Reputation: 2100

How to set rule using regex in scrapy for extracting urls?

I want to crawl pages related to Disney on bloomberg websites. The url follow pattern as

        "http://bloomberg.com/news/2013-07-08/disney-welcometohomepageofdisney"

So, i have written below rule for it

          rules = [
    Rule(SgmlLinkExtractor(allow=('/news/*/disney*',)), follow=True),
          ]

but the above rule doesn't working as i want and i am getting crawled pages output not related to Disney. please help to fix this rule.

Upvotes: 1

Views: 2930

Answers (2)

Blender
Blender

Reputation: 298246

/news/* matches /news followed by any number of /.

The correct regex would be:

/news/.*/disney

Upvotes: 3

abc123
abc123

Reputation: 18803

You likely need the following regex:

 /news/[^/]+/disney.*

which escaped looks like

\/news\/[^\/]+\/disney.*

this way you will find the next / but not anything.

Example here

Upvotes: 1

Related Questions