Albert Delhom
Albert Delhom

Reputation: 13

Scrapy rules for links selection

I am trying to scrape vertically pages that are following a simple rule in the html direction:

They have /MLA#### or /MLA-#### (# as random numbers)

The problem is that with the following code scrapy only detects me as good the pages with /MLA-#### name when a /MLA#### or /####MLA### page appears my scrapy code doesn't work and the scraping is wrong

 rules =  (Rule(LinkExtractor(allow=r'/_Desde_'), follow=True),
        Rule(LinkExtractor(allow='/'+'MLA'), follow=True, callback='parse_items'))

Previously it was as it follows:

 rules =  (Rule(LinkExtractor(allow=r'/_Desde_'), follow=True),
        Rule(LinkExtractor(allow=r'/MLA'), follow=True, callback='parse_items'))

So how can I say to my code: I want to scrapy all the links that contain MLA no matter what is preceeding or following the words.

Thanks for you comments, Regards

Upvotes: 0

Views: 54

Answers (1)

Michael Savchenko
Michael Savchenko

Reputation: 1445

in fact '/' + 'MLA' is totally equal to '/MLA' (: it's about string concatenation. What you need is couple of regular expressions.

I think Rule(LinkExtractor(allow=[r'\d+MLA', r'MLA-\d+'], follow=True, callback='parse_items') will work for you. Take a read on regular expressions - it's a must for scraping. In this case everything is quite simple: we have MLA and /d+ which stands for one or more digits.

Good luck.

Upvotes: 1

Related Questions