Reputation: 879
I wanted to ask is there option with Scrapy to crawl websites using only URL and regular expressions. When I what to extract certain information you need to use rules (not always) to extract links and fallow those links to the page where needed information is, but what I mean, is it possible to take URL and use it with regular expressions to generate requests and than parse results.
For an example lets take this URL:
http//:www.example.com/date/2014/news/117
Let say that all the articles are in the last part of URL “/117”. So to my mind it would be easer to write regular expressions for the URL:
http//:www.example.com/date/2014/news/\d+
If with this regular expression you could make HTTP requests to the pages that it would make life very simple in some cases. I wonder is there such way?
Upvotes: 0
Views: 371
Reputation: 11396
CrawlerSpider with the right link extractor can do just that, see an example from scrapy docs:
class MySpider(CrawlSpider):
...
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
...
Upvotes: 1