Reputation: 13
I am trying to scrape vertically pages that are following a simple rule in the html direction:
They have /MLA#### or /MLA-#### (# as random numbers)
The problem is that with the following code scrapy only detects me as good the pages with /MLA-#### name when a /MLA#### or /####MLA### page appears my scrapy code doesn't work and the scraping is wrong
rules = (Rule(LinkExtractor(allow=r'/_Desde_'), follow=True),
Rule(LinkExtractor(allow='/'+'MLA'), follow=True, callback='parse_items'))
Previously it was as it follows:
rules = (Rule(LinkExtractor(allow=r'/_Desde_'), follow=True),
Rule(LinkExtractor(allow=r'/MLA'), follow=True, callback='parse_items'))
So how can I say to my code: I want to scrapy all the links that contain MLA no matter what is preceeding or following the words.
Thanks for you comments, Regards
Upvotes: 0
Views: 54
Reputation: 1445
in fact '/' + 'MLA'
is totally equal to '/MLA'
(: it's about string concatenation. What you need is couple of regular expressions.
I think Rule(LinkExtractor(allow=[r'\d+MLA', r'MLA-\d+'], follow=True, callback='parse_items')
will work for you. Take a read on regular expressions - it's a must for scraping. In this case everything is quite simple: we have MLA and /d+
which stands for one or more digits.
Good luck.
Upvotes: 1