David Yi
David Yi

Reputation: 401

Scrapy: LinkExtractor not working

I am trying to crawl Erowid and gather data about experiences. I am trying to get from the general information about a drug to the actual experience itself.

However the LinkExtractor doesn't seem to be working.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector

from Erowid.items import ErowidItem


class ExperiencesSpider(CrawlSpider):
    name = "test"
    allowed_domains = ["www.erowid.org"]
    start_urls = ['https://www.erowid.org/experiences/subs/exp_aPVP.shtml']
    rules = [ 
        Rule(LinkExtractor(allow =('/experiences/exp.php?ID=[0-9]+')),     callback = 'parse_item', follow = True)

    ]
    def parse_item(self, response):
        [other code]

From https://www.erowid.org/experiences/subs/exp_aPVP.shtml, I am trying to reach the experiences which have an href of

/experiences/exp.php?ID=  (some digits)

I can't find the proper code after ID and I have already tried already a variety of different regex including

\d+ and [0-9]+

Is the error caused by an incorrect regex expression? If yes then what would be the correct regex expression? If no then why is this error occurring and how can I fix it?

Upvotes: 2

Views: 1364

Answers (1)

alecxe
alecxe

Reputation: 473803

Here is the expression that works for me:

/experiences/exp\.php\?ID=\d+$

And here is how the rules look:

rules = [
    Rule(LinkExtractor(allow=r'/experiences/exp\.php\?ID=\d+$'),
         callback='parse_item', follow=True)
]

Upvotes: 2

Related Questions