Reputation: 21
I'm trying to obtain every single link (and no other data) from a website using Scrapy. I want to do this by starting at the homepage, scraping all the links from there, then for each link found, follow the link and scrape all (unique) links from that page, and do this for all links found until there are no more to follow.
I also have to enter a username and password to get into each page on the site, so I've included a basic authentication component to my start_requests.
So far I have a spider which gives me the links on the homepage only, however I can't seem to figure out why it's not following the links and scraping other pages.
Here is my spider:
from examplesite.items import ExamplesiteItem
import scrapy
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from scrapy import Request
from w3lib.http import basic_auth_header
from scrapy.crawler import CrawlerProcess
class ExampleSpider(CrawlSpider):
#name of crawler
name = "examplesite"
#only scrape on pages within the example.co.uk domain
allowed_domains = ["example.co.uk"]
#start scraping on the site homepage once credentials have been authenticated
def start_requests(self):
url = str("https://example.co.uk")
username = "*********"
password = "*********"
auth = basic_auth_header(username, password)
yield scrapy.Request(url=url,headers={'Authorization': auth})
#rules for recursively scraping the URLS found
rules = [
Rule(
LinkExtractor(
canonicalize=True,
unique=True
),
follow=True,
callback="parse"
)
]
#method to identify hyperlinks by xpath and extract hyperlinks as scrapy items
def parse(self, response):
for element in response.xpath('//a'):
item = ExamplesiteItem()
oglink = element.xpath('@href').extract()
#need to add on prefix as some hrefs are not full https URLs and thus cannot be followed for scraping
if "http" not in str(oglink):
item['link'] = "https://example.co.uk" + oglink[0]
else:
item['link'] = oglink
yield item
Here is my items class:
from scrapy import Field, Item
class ExamplesiteItem(Item):
link = Field()
I think the bit I'm going wrong is the "Rules", which I'm aware you need to follow the links, but I don't fully understand how it works (have tried reading several explanations online but still not sure).
Any help would be much appreciated!
Upvotes: 2
Views: 3292
Reputation: 28216
Your rules are fine, the problem is overriding the parse
method.
From the scrapy docs at https://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
When writing crawl spider rules, avoid using
parse
as callback, since theCrawlSpider
uses theparse
method itself to implement its logic. So if you override theparse
method, the crawl spider will no longer work.
Upvotes: 2