Scrapy LinkExtractor fails to find existing url

Question

I have a Crawler like this:

class SkySpider(CrawlSpider):
    name = "spider_v1"
    allowed_domains = [
        "atsu.edu",
    ]

    start_urls = [
        "http://www.atsu.edu",
    ]

    rules = (
        Rule(
            INFO_LINKS_EXTRACTOR,
            follow=True,
            callback='parse_item',
          ),
     )
    def parse_item(self, response):
        print("ENTERED!")
        item = SportsScraperItem()
        item["contact"] = self._parse_contact(response)
        return item

In my helpers.py I have:

from scrapy.linkextractors import LinkExtractor


def _r(string):
    return f"(.*?)(\b{string}\b)(.*)"


INFO_LINKS_EXTRACTOR = LinkExtractor(
    allow=(
        _r('about'),
    ),
    unique=True,
)

I know that atsu.edu has a link https://www.atsu.edu/about-atsu/, but my extractor seems like does not see it and parse_item() method is not run. What am I doing wrong here?
EDIT 1: Logs:

2019-10-01 15:40:58 [scrapy.core.engine] INFO: Spider opened
2019-10-01 15:40:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-10-01 15:40:58 [steppersspider_v1] INFO: Spider opened: steppersspider_v1
2019-10-01 15:40:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-10-01 15:40:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-10-01 15:41:05 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-10-01 15:41:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-10-01 15:41:15 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-10-01 15:41:19 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-10-01 15:41:19 [steppersspider_v1] DEBUG: Saved file steppers-www.atsu.edu.html
2019-10-01 15:41:20 [scrapy.core.engine] INFO: Closing spider (finished)
2019-10-01 15:41:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

EDIT 2
Here is how I tested this regexp on regexp101.com.

EDIT 3
Working function for regexp:

def _r(string):
    return r"^(.*?)(\b{string}\b)(.*)$".format(string=string)

Gallaecio · Accepted Answer

By default, link extractors only search for a and area tags. The links you are looking for seem to be in li tags.

You need to pass the tags parameter to the constructor of your link extractor with the desired tags. For example:

tags=('a', 'area', 'li')

See https://doc.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors.lxmlhtml

Scrapy LinkExtractor fails to find existing url

Answers (1)

Related Questions