elton
elton

Reputation: 65

How to get the proxy used for each request in an item with Scrapy?

I'm using a DOWNLOADER_MIDDLEWARES for rotating proxies with an scrapy.Spider and I would like to get an item , i.e. item['proxy_used'], for the proxy used for each request.

I guess it could be possible to get the Proxy over the "Stats Collector" but I'm new to Python and Scrapy and till now I haven't been able to come across with a solution.

import scrapy
from tutorial.items import QuotesItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    def parse_quotes(self, response):
        for sel in response.css('div.quote'):
            item = QuotesItem()
            item['text'] = sel.css('span.text::text').get()
            item['author'] = sel.css('small.author::text').get()
            item['tags'] = sel.css('div.tags a.tag::text').getall()
            item['quotelink'] = sel.css('small.author ~ a[href*="goodreads.com"]::attr(href)').get()

            item['proxy_used'] = ??? <-- PROXY USED BY REQUEST - "HOW TO???"
            yield item 

     # follow pagination links @shortcut

        for a in response.css('li.next a'):
            yield response.follow(a, callback = self.parse_quotes)

Upvotes: 2

Views: 461

Answers (1)

Colwin
Colwin

Reputation: 2685

You can use the response object to access the proxy used. Like below

response.meta.get("proxy")

Updated in your code too.

import scrapy
from tutorial.items import QuotesItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    def parse_quotes(self, response):
        for sel in response.css('div.quote'):
            item = QuotesItem()
            item['text'] = sel.css('span.text::text').get()
            item['author'] = sel.css('small.author::text').get()
            item['tags'] = sel.css('div.tags a.tag::text').getall()
            item['quotelink'] = sel.css('small.author ~ a[href*="goodreads.com"]::attr(href)').get()

            item['proxy_used'] = response.meta.get("proxy")
            yield item 

     # follow pagination links @shortcut

        for a in response.css('li.next a'):
            yield response.follow(a, callback = self.parse_quotes)

Upvotes: 1

Related Questions