Allie
Allie

Reputation: 21

How to recursively scrape every link from a site using Scrapy?

I'm trying to obtain every single link (and no other data) from a website using Scrapy. I want to do this by starting at the homepage, scraping all the links from there, then for each link found, follow the link and scrape all (unique) links from that page, and do this for all links found until there are no more to follow.

I also have to enter a username and password to get into each page on the site, so I've included a basic authentication component to my start_requests.

So far I have a spider which gives me the links on the homepage only, however I can't seem to figure out why it's not following the links and scraping other pages.

Here is my spider:

    from examplesite.items import ExamplesiteItem
    import scrapy
    from scrapy.linkextractor import LinkExtractor
    from scrapy.spiders import Rule, CrawlSpider
    from scrapy import Request
    from w3lib.http import basic_auth_header
    from scrapy.crawler import CrawlerProcess

    class ExampleSpider(CrawlSpider):
#name of crawler
name = "examplesite"

#only scrape on pages within the example.co.uk domain
allowed_domains = ["example.co.uk"]

#start scraping on the site homepage once credentials have been authenticated
def start_requests(self):
    url = str("https://example.co.uk")
    username = "*********"
    password = "*********"
    auth = basic_auth_header(username, password)
    yield scrapy.Request(url=url,headers={'Authorization': auth})

#rules for recursively scraping the URLS found
rules = [
    Rule(
        LinkExtractor(
            canonicalize=True,
            unique=True
        ),
        follow=True,
        callback="parse"
    )
]

#method to identify hyperlinks by xpath and extract hyperlinks as scrapy items
def parse(self, response):
    for element in response.xpath('//a'):
        item = ExamplesiteItem()
        oglink = element.xpath('@href').extract()
        #need to add on prefix as some hrefs are not full https URLs and thus cannot be followed for scraping
        if "http" not in str(oglink):
            item['link'] = "https://example.co.uk" + oglink[0]
        else:
            item['link'] = oglink

        yield item

Here is my items class:

    from scrapy import Field, Item

    class ExamplesiteItem(Item):
        link = Field()

I think the bit I'm going wrong is the "Rules", which I'm aware you need to follow the links, but I don't fully understand how it works (have tried reading several explanations online but still not sure).

Any help would be much appreciated!

Upvotes: 2

Views: 3292

Answers (1)

stranac
stranac

Reputation: 28216

Your rules are fine, the problem is overriding the parse method.

From the scrapy docs at https://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Upvotes: 2

Related Questions