Vy.Iv
Vy.Iv

Reputation: 879

Is it possible to take URL and use it with regular expression to generate requests(Scrapy)

I wanted to ask is there option with Scrapy to crawl websites using only URL and regular expressions. When I what to extract certain information you need to use rules (not always) to extract links and fallow those links to the page where needed information is, but what I mean, is it possible to take URL and use it with regular expressions to generate requests and than parse results.

For an example lets take this URL:

http//:www.example.com/date/2014/news/117

Let say that all the articles are in the last part of URL “/117”. So to my mind it would be easer to write regular expressions for the URL:

http//:www.example.com/date/2014/news/\d+

If with this regular expression you could make HTTP requests to the pages that it would make life very simple in some cases. I wonder is there such way?

Upvotes: 0

Views: 371

Answers (1)

Guy Gavriely
Guy Gavriely

Reputation: 11396

CrawlerSpider with the right link extractor can do just that, see an example from scrapy docs:

class MySpider(CrawlSpider):
    ...
    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    ...

Upvotes: 1

Related Questions