booleantrue
booleantrue

Reputation: 129

How can I implement custom proxy on Scrapy?

I'm trying to implement custom scraper API but as a begging I think I'm doing wrong. But I follow their documentation to setup everything. Here is a documentation link

from scrapy import Spider
from scrapy.http import Request
from .config import API
from scraper_api import ScraperAPIClient
client = ScraperAPIClient(API)



class GlassSpider(Spider):
    name = 'glass'
    allowed_domains = ['glassdoor.co.uk']
    start_urls = [client.scrapyGet(url='https://www.glassdoor.co.uk/Job/russian-jobs-SRCH_KE0,7.htm?fromAge=1')]
   

    def parse(self, response):
        jobs = response.xpath('//*[contains(@class, "react-job-listing")]')
        for job in jobs:
            job_url = job.xpath('.//*[contains(@class, "jobInfoItem jobTitle")]/@href').extract_first()
            absulate_job_url = response.urljoin(job_url)

            yield Request(client.scrapyGet(url=absulate_job_url),
                           callback=self.parse_jobpage,
                           meta={
                               "Job URL": absulate_job_url
                        })

    def parse_jobpage(self, response): 
        absulate_job_url = response.meta.get('Job URL')
        job_description = "".join(line for line in response.xpath('//*[contains(@class, "desc")]//text()').extract())

        yield {
            "Job URL": absulate_job_url,   
            "Job Description": job_description
        }

That's the output I'm receiving.... Please what's wrong with my code. Please fix it for me. So I can follow and get the point. Thank you.

2020-10-01 23:01:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.scraperapi.com/?url=https%3A%2F%2Fwww.glassdoor.co.uk%2FJob%2F russian-jobs-SRCH_KE0%2C7.htm%3FfromAge%3D1&api_key=bec9dd9f2be095dfc6158a7e609&scraper_sdk=python> (referer: None) 2020-10-01 23:01:45 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET https://api.scraperapi.c om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6 7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61 5c8a7e639scraper_sdk=python>

Upvotes: 0

Views: 166

Answers (1)

renatodvc
renatodvc

Reputation: 2564

I'm not familiar with this particular lib, but from your execution logs the issue is that your request is beign filtered, since it's consider offsite.

[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET https://api.scraperapi.c om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6 7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61 5c8a7e639scraper_sdk=python>

Since scraperapi will make your request go through their domain and that's outside of what you defined in your allowed_domains it's filtered as an offsite request. To avoid this issue you can remove this line entirely:

allowed_domains = ['glassdoor.co.uk'] 

or try include 'api.scraperapi.com' in it.

Upvotes: 1

Related Questions