Scrapy: LinkExtractor not working

Question

I am trying to crawl Erowid and gather data about experiences. I am trying to get from the general information about a drug to the actual experience itself.

However the LinkExtractor doesn't seem to be working.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector

from Erowid.items import ErowidItem


class ExperiencesSpider(CrawlSpider):
    name = "test"
    allowed_domains = ["www.erowid.org"]
    start_urls = ['https://www.erowid.org/experiences/subs/exp_aPVP.shtml']
    rules = [ 
        Rule(LinkExtractor(allow =('/experiences/exp.php?ID=[0-9]+')),     callback = 'parse_item', follow = True)

    ]
    def parse_item(self, response):
        [other code]

From https://www.erowid.org/experiences/subs/exp_aPVP.shtml, I am trying to reach the experiences which have an href of

/experiences/exp.php?ID=  (some digits)

I can't find the proper code after ID and I have already tried already a variety of different regex including

\d+ and [0-9]+

Is the error caused by an incorrect regex expression? If yes then what would be the correct regex expression? If no then why is this error occurring and how can I fix it?

alecxe · Accepted Answer

Here is the expression that works for me:

/experiences/exp\.php\?ID=\d+$

And here is how the rules look:

rules = [
    Rule(LinkExtractor(allow=r'/experiences/exp\.php\?ID=\d+$'),
         callback='parse_item', follow=True)
]

Scrapy: LinkExtractor not working

Answers (1)

Related Questions