Reputation: 22440
I've written a tiny scraper in python scrapy to parse different names from a webpage. The page has traversed 4 more pages through pagination. The total names throughout the pages are 46 but it is scraping 36 names.
The scraper is supposed to skip the content of first landing pages but using parse_start_url
argument in my scraper I've handled it.
However, the problem I'm facing at this moment with this scraper is that It surprisingly skips the content of second page and parse all the rest, I meant first page, third page, fourth page and so on. Why it is happening and how to deal with that? Thanks in advance.
Here is the script I'm trying with:
import scrapy
class DataokSpider(scrapy.Spider):
name = "dataoksp"
start_urls = ["https://data.ok.gov/browse?page=1&f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"]
def parse(self, response):
for link in response.css('.pagination .pager-item a'):
new_link = link.css("::attr(href)").extract_first()
yield scrapy.Request(url=response.urljoin(new_link), callback=self.target_page)
def target_page(self, response):
parse_start_url = self.target_page # I used this argument to capture the content of first page
for titles in response.css('.title a'):
name = titles.css("::text").extract_first()
yield {'Name':name}
Upvotes: 3
Views: 390
Reputation: 22440
The solution turns out to be very easy. I've fixed it already.
import scrapy
class DataokSpider(scrapy.Spider):
name = "dataoksp"
start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"]
def parse(self, response):
for f_link in self.start_urls:
yield response.follow(url=f_link, callback=self.target_page) #this is line which fixes the issue
for link in response.css('.pagination .pager-item a'):
new_link = link.css("::attr(href)").extract_first()
yield response.follow(url=new_link, callback=self.target_page)
def target_page(self, response):
for titles in response.css('.title a'):
name = titles.css("::text").extract_first()
yield {'Name':name}
Now it gives me all the results.
Upvotes: 1
Reputation: 4021
Because the link you are specifying in start_urls is actually the link of the second page. If you open it, you'll see there's no <a>
tag for the current page. That's why page 2 isn't reaching target_page
and therefore, you should point start_urls to:
https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191
This code should help you:
import scrapy
from scrapy.http import Request
class DataokspiderSpider(scrapy.Spider):
name = 'dataoksp'
allowed_domains = ['data.ok.gov']
start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191",]
def parse(self, response):
for titles in response.css('.title a'):
name = titles.css("::text").extract_first()
yield {'Name':name}
next_page = response.xpath('//li[@class="pager-next"]/a/@href').extract_first()
if next_page:
yield Request("https://data.ok.gov{}".format(next_page), callback=self.parse)
Stats (see item_scraped_count
):
{
'downloader/request_bytes': 2094,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 45666,
'downloader/response_count': 6,
'downloader/response_status_count/200': 6,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 19, 7, 23, 47, 801934),
'item_scraped_count': 46,
'log_count/DEBUG': 53,
'log_count/INFO': 7,
'memusage/max': 47509504,
'memusage/startup': 47509504,
'request_depth_max': 4,
'response_received_count': 6,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2017, 9, 19, 7, 23, 46, 59360)
}
Upvotes: 0