Reputation: 2671
I have a Crawler like this:
class SkySpider(CrawlSpider):
name = "spider_v1"
allowed_domains = [
"atsu.edu",
]
start_urls = [
"http://www.atsu.edu",
]
rules = (
Rule(
INFO_LINKS_EXTRACTOR,
follow=True,
callback='parse_item',
),
)
def parse_item(self, response):
print("ENTERED!")
item = SportsScraperItem()
item["contact"] = self._parse_contact(response)
return item
In my helpers.py I have:
from scrapy.linkextractors import LinkExtractor
def _r(string):
return f"(.*?)(\b{string}\b)(.*)"
INFO_LINKS_EXTRACTOR = LinkExtractor(
allow=(
_r('about'),
),
unique=True,
)
I know that atsu.edu has a link https://www.atsu.edu/about-atsu/, but my extractor seems like does not see it and parse_item()
method is not run. What am I doing wrong here?
EDIT 1:
Logs:
2019-10-01 15:40:58 [scrapy.core.engine] INFO: Spider opened
2019-10-01 15:40:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-10-01 15:40:58 [steppersspider_v1] INFO: Spider opened: steppersspider_v1
2019-10-01 15:40:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-10-01 15:40:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.atsu.edu/robots.txt> from <GET http://WWW.ATSU.EDU/robots.txt>
2019-10-01 15:41:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atsu.edu/robots.txt> (referer: None)
2019-10-01 15:41:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.atsu.edu/> from <GET http://WWW.ATSU.EDU>
2019-10-01 15:41:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atsu.edu/robots.txt> (referer: None)
2019-10-01 15:41:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atsu.edu/> (referer: None)
2019-10-01 15:41:19 [steppersspider_v1] DEBUG: Saved file steppers-www.atsu.edu.html
2019-10-01 15:41:20 [scrapy.core.engine] INFO: Closing spider (finished)
2019-10-01 15:41:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
EDIT 2
Here is how I tested this regexp on regexp101.com.
EDIT 3
Working function for regexp:
def _r(string):
return r"^(.*?)(\b{string}\b)(.*)$".format(string=string)
Upvotes: 1
Views: 77
Reputation: 3847
By default, link extractors only search for a
and area
tags. The links you are looking for seem to be in li
tags.
You need to pass the tags
parameter to the constructor of your link extractor with the desired tags. For example:
tags=('a', 'area', 'li')
See https://doc.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors.lxmlhtml
Upvotes: 2