Scrapy crawler to return only URL and Referrer when crawling

Question

I'm VERY new to scrapy, just found it yesterday and only basic python experience.

I have a group of sub-domains (around 200) that I need to map, every internal and external link.

I'm just not understanding the output side of things I think.

This is what I have so far.

import scrapy

class LinkSpider(scrapy.Spider):
    name = 'links'

    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com/']

    def parse(self, response):
        # follow all links
        for href in response.css('a::attr(href)'):
            yield response.follow(href, self.parse)

it outputs to the terminal like so:

DEBUG: Crawled (200)  (referer: None)
DEBUG: Crawled (200)  (referer: http://www.example.com/)
DEBUG: Crawled (200)  (referer: http://www.example.com/)

What I'm after is a CSV or TSV

URL                                         Referer
http://www.example.com/                     None
http://www.example.com/aaa/A-content-page   http://www.example.com/
http://aaa.example.com/bbb/something/       http://www.example.com/
http://aaa.example.com/bbb/another/         http://aaa.example.com/bbb/something/

Any assistance is appreciated but would prefer a referral to docs than straight solution.

This is the solution I came up with.

    def parse(self, response):
        filename = "output.tsv"
        f = open(filename, 'w')
        f.write("URL	Link	Referer
")
        f.close()
        # follow all links
        for href in response.css('a::attr(href)'):            
            yield response.follow(href, self.parse)
            with open(filename, 'a') as f:
                url = response.url
                links = response.css('a::attr(href)').getall()
                referer = response.request.headers.get('referer', None).decode('utf-8')
                for item in links:
                    f.write("{0}	{1}	{2}
".format(url, item, referer))

        f.close()

Ikram Khan Niazi · Accepted Answer

You can get both urls simply in parse.

referer = response.request.headers.get('Referer') original_url = response.url

yield {'referer': referer, 'url': original_url}

You can write the output to file using

scrapy crawl spider_name -o bettybarclay.json

Scrapy crawler to return only URL and Referrer when crawling

Answers (2)

Related Questions