Chiefir
Chiefir

Reputation: 2671

Scrapy LinkExtractor fails to find existing url

I have a Crawler like this:

class SkySpider(CrawlSpider):
    name = "spider_v1"
    allowed_domains = [
        "atsu.edu",
    ]

    start_urls = [
        "http://www.atsu.edu",
    ]

    rules = (
        Rule(
            INFO_LINKS_EXTRACTOR,
            follow=True,
            callback='parse_item',
          ),
     )
    def parse_item(self, response):
        print("ENTERED!")
        item = SportsScraperItem()
        item["contact"] = self._parse_contact(response)
        return item

In my helpers.py I have:

from scrapy.linkextractors import LinkExtractor


def _r(string):
    return f"(.*?)(\b{string}\b)(.*)"


INFO_LINKS_EXTRACTOR = LinkExtractor(
    allow=(
        _r('about'),
    ),
    unique=True,
)

I know that atsu.edu has a link https://www.atsu.edu/about-atsu/, but my extractor seems like does not see it and parse_item() method is not run. What am I doing wrong here?
EDIT 1: Logs:

2019-10-01 15:40:58 [scrapy.core.engine] INFO: Spider opened
2019-10-01 15:40:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-10-01 15:40:58 [steppersspider_v1] INFO: Spider opened: steppersspider_v1
2019-10-01 15:40:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-10-01 15:40:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.atsu.edu/robots.txt> from <GET http://WWW.ATSU.EDU/robots.txt>
2019-10-01 15:41:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atsu.edu/robots.txt> (referer: None)
2019-10-01 15:41:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.atsu.edu/> from <GET http://WWW.ATSU.EDU>
2019-10-01 15:41:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atsu.edu/robots.txt> (referer: None)
2019-10-01 15:41:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atsu.edu/> (referer: None)
2019-10-01 15:41:19 [steppersspider_v1] DEBUG: Saved file steppers-www.atsu.edu.html
2019-10-01 15:41:20 [scrapy.core.engine] INFO: Closing spider (finished)
2019-10-01 15:41:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

EDIT 2
Here is how I tested this regexp on regexp101.com.

EDIT 3
Working function for regexp:

def _r(string):
    return r"^(.*?)(\b{string}\b)(.*)$".format(string=string)

Upvotes: 1

Views: 77

Answers (1)

Gallaecio
Gallaecio

Reputation: 3847

By default, link extractors only search for a and area tags. The links you are looking for seem to be in li tags.

You need to pass the tags parameter to the constructor of your link extractor with the desired tags. For example:

tags=('a', 'area', 'li')

See https://doc.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors.lxmlhtml

Upvotes: 2

Related Questions