Reputation: 22440
I've written a script in python using scrapy
to parse some information from a webpage. The data available in that webpage traverse through pagination. If I go for using response.follow()
then I can get it done. However, I would like to follow the logic I implemented in requests
with BeautifulSoup
within scrapy
but can't find any idea.
Using requests
along with BeautifulSoup
I could come up with this which is doing just fine:
import requests
from bs4 import BeautifulSoup
page = 0
URL = 'http://esencjablog.pl/page/{}/'
while True:
page+=1
res = requests.get(URL.format(page))
soup = BeautifulSoup(res.text,'lxml')
items = soup.select('.post_more a.qbutton')
if len(items)<=1:break
for a in items:
print(a.get("href"))
I would like to do the same using scrapy
following the logic I applied above but every time I try to perform it, I end up doing something like below:
class PaginationTestSpider(scrapy.Spider):
name = 'pagination'
start_urls = ['http://esencjablog.pl/page/{}/'.format(page) for page in range(1,63)] #I used 63 here because the highest page number is 62
def parse(self, response):
for link in response.css('.post_more a.qbutton'):
yield{"link":link.css('::attr(href)').extract_first()}
Once again: my question is If I wish to do the way in scrapy
what I already tried with requests
and BeautifulSoup
when the last page number is unknown then how would the structure be?
Upvotes: 0
Views: 785
Reputation: 146510
In that case you can't take advantage of parallel downloads, but since you want to simulate the same thing in Scrapy this can be achieved in different ways
Approach 1 - Yield page using page numbers
class PaginationTestSpider(scrapy.Spider):
name = 'pagination'
# Start with page #1
start_urls = ['http://esencjablog.pl/page/1/']
def parse(self, response):
# we commnicate the page numbers using request meta
# this is not mandatory as we can extract the same data from
# the response.url also. But I prefer using meta here
page_no = response.meta.get('page', 1) + 1
items = response.css('.post_more a.qbutton')
for link in items:
yield{"link":link.css('::attr(href)').extract_first()}
if items:
# if items were found we move to the next page
yield Request("http://esencjablog.pl/page/{}".format(page_no), meta={"page": page_no}, callback=self.parse)
The ideal way would usually be that if you can find the last page count from the first request then you will extract that number and fire all the request in one in first parse
call. But that would only work if it is possible to know the last page number
Approach 2 - Yield next page using object
class PaginationTestSpider(scrapy.Spider):
name = 'pagination'
# Start with page #1
start_urls = ['http://esencjablog.pl/page/1/']
def parse(self, response):
items = response.css('.post_more a.qbutton')
for link in items:
yield{"link":link.css('::attr(href)').extract_first()}
next_page = response.xpath('//li[contains(@class, "next_last")]/a/@href')
if next_page:
yield response.follow(next_page) # follow to next page, and parse again
This is nothing but a blunt copy of what @Konstantin mentioned. Sorry but want to make this a more complete answer
Approach 3 - Yield all page on first response
class PaginationTestSpider(scrapy.Spider):
name = 'pagination'
# Start with page #1
start_urls = ['http://esencjablog.pl/page/1/']
first_request = True
def parse(self, response):
if self.first_request:
self.first_request = False
last_page_num = response.css("fa-angle-double-right::href").re_first("(\d+)/?$")
# yield all the pages on first request so we take advantage to parallel downloads
for page_no in range(2, last_page_num + 1):
yield Request("http://esencjablog.pl/page/{}".format(page_no), callback=self.parse)
items = response.css('.post_more a.qbutton')
for link in items:
yield {"link":link.css('::attr(href)').extract_first()}
The best point in this approach is that you browse first page, then you check the last page count, and yield all the pages so simultaneous downloads happen. The first 2 approach are more sequential in nature and you would only follow them if you don't want to load the site much at all. The ideal approach for a scraper is Approach 3
.
Now regarding of the use of meta
object, it is well explained on below link
Adding the same here for reference
The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument.
Example:
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.
Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
yield item
Upvotes: 4
Reputation: 547
You can iterate through the pages like this scrapy doc:
class PaginationTestSpider(scrapy.Spider):
name = 'pagination'
start_urls = ['http://esencjablog.pl/page/1/'] # go to first page
def parse(self, response):
for link in response.css('.post_more a.qbutton'):
yield{"link":link.css('::attr(href)').extract_first()}
next_page = response.xpath('//li[contains(@class, "next_last")]/a/@href')
if next_page:
yield response.follow(next_page) # follow to next page, and parse again
Upvotes: 0
Reputation: 1981
You have to use scrapy.Request for that:
class PaginationTestSpider(scrapy.Spider):
name = 'pagination'
start_urls = ['http://esencjablog.pl/page/58']
def parse(self, response):
# Find href from next page link
link = response.css('.post_more a.qbutton::attr(href)')
if link:
# Extract href, in this case we can use first because you only need 1
href = link.extract_first()
# just in case the website use relative hrefs
url = response.urljoin(href)
# You may change the callback if you want to use a different method
yield scrapy.Request(url, callback=self.parse)
You can find more details in the scrapy documentation
Upvotes: 0