Reputation: 401
I am trying to crawl Erowid and gather data about experiences. I am trying to get from the general information about a drug to the actual experience itself.
However the LinkExtractor doesn't seem to be working.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from Erowid.items import ErowidItem
class ExperiencesSpider(CrawlSpider):
name = "test"
allowed_domains = ["www.erowid.org"]
start_urls = ['https://www.erowid.org/experiences/subs/exp_aPVP.shtml']
rules = [
Rule(LinkExtractor(allow =('/experiences/exp.php?ID=[0-9]+')), callback = 'parse_item', follow = True)
]
def parse_item(self, response):
[other code]
From https://www.erowid.org/experiences/subs/exp_aPVP.shtml, I am trying to reach the experiences which have an href of
/experiences/exp.php?ID= (some digits)
I can't find the proper code after ID and I have already tried already a variety of different regex including
\d+ and [0-9]+
Is the error caused by an incorrect regex expression? If yes then what would be the correct regex expression? If no then why is this error occurring and how can I fix it?
Upvotes: 2
Views: 1364
Reputation: 473803
Here is the expression that works for me:
/experiences/exp\.php\?ID=\d+$
And here is how the rules
look:
rules = [
Rule(LinkExtractor(allow=r'/experiences/exp\.php\?ID=\d+$'),
callback='parse_item', follow=True)
]
Upvotes: 2