Reputation: 129
I'm trying to implement custom scraper API but as a begging I think I'm doing wrong. But I follow their documentation to setup everything. Here is a documentation link
from scrapy import Spider
from scrapy.http import Request
from .config import API
from scraper_api import ScraperAPIClient
client = ScraperAPIClient(API)
class GlassSpider(Spider):
name = 'glass'
allowed_domains = ['glassdoor.co.uk']
start_urls = [client.scrapyGet(url='https://www.glassdoor.co.uk/Job/russian-jobs-SRCH_KE0,7.htm?fromAge=1')]
def parse(self, response):
jobs = response.xpath('//*[contains(@class, "react-job-listing")]')
for job in jobs:
job_url = job.xpath('.//*[contains(@class, "jobInfoItem jobTitle")]/@href').extract_first()
absulate_job_url = response.urljoin(job_url)
yield Request(client.scrapyGet(url=absulate_job_url),
callback=self.parse_jobpage,
meta={
"Job URL": absulate_job_url
})
def parse_jobpage(self, response):
absulate_job_url = response.meta.get('Job URL')
job_description = "".join(line for line in response.xpath('//*[contains(@class, "desc")]//text()').extract())
yield {
"Job URL": absulate_job_url,
"Job Description": job_description
}
That's the output I'm receiving.... Please what's wrong with my code. Please fix it for me. So I can follow and get the point. Thank you.
2020-10-01 23:01:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.scraperapi.com/?url=https%3A%2F%2Fwww.glassdoor.co.uk%2FJob%2F russian-jobs-SRCH_KE0%2C7.htm%3FfromAge%3D1&api_key=bec9dd9f2be095dfc6158a7e609&scraper_sdk=python> (referer: None) 2020-10-01 23:01:45 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET https://api.scraperapi.c om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6 7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61 5c8a7e639scraper_sdk=python>
Upvotes: 0
Views: 166
Reputation: 2564
I'm not familiar with this particular lib, but from your execution logs the issue is that your request is beign filtered, since it's consider offsite.
[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET https://api.scraperapi.c om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6 7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61 5c8a7e639scraper_sdk=python>
Since scraperapi will make your request go through their domain and that's outside of what you defined in your allowed_domains
it's filtered as an offsite request. To avoid this issue you can remove this line entirely:
allowed_domains = ['glassdoor.co.uk']
or try include 'api.scraperapi.com'
in it.
Upvotes: 1