Reputation: 755
I am trying a parse a domain that whose contents are as follows
Page 1 - contains links to 10 articles
Page 2 - contains links to 10 articles
Page 3 - contains links to 10 articles
and so on...
My job is to parse all the articles on all pages.
My thought - Parse all the pages and store links to all the articles in a list and then iterate the list and parse the links.
So far I have been able to iterate through the pages, parse and collect links to the articles. I am stuck on how to start parsing this list.
My Code so far...
import scrapy
class DhoniSpider(scrapy.Spider):
name = "test"
start_urls = [
"https://www.news18.com/cricketnext/newstopics/ms-dhoni.html"
]
count = 0
def __init__(self, *a, **kw):
super(DhoniSpider, self).__init__(*a, **kw)
self.headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
self.seed_urls = []
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, headers=self.headers, callback=self.parse)
def parse(self, response):
DhoniSpider.count += 1
if DhoniSpider.count > 2 :
# there are many pages, this is just to stop parsing after 2 pages
return
for ul in response.css('div.t_newswrap'):
ref_links = ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall()
self.seed_urls.extend(ref_links)
next_page = response.css('ul.pagination li a.nxt::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, headers=self.headers, callback=self.parse)
def iterate_urls(self):
for link in self.seed_urls:
link = response.urljoin(link)
yield scrapy.Request(link, headers=self.headers, callback=self.parse_page)
def parse_page(self, response):
print("called")
how to iterate my self.seed_urls
list and parse them? From where should I call my iterate_urls
function?
Upvotes: 2
Views: 583
Reputation: 3561
Usually in this cases there is no need to make external function like your iterate_urls
:
def parse(self, response):
DhoniSpider.count += 1
if DhoniSpider.count > 2 :
# there are many pages, this is just to stop parsing after 2 pages
return
for ul in response.css('div.t_newswrap'):
for ref_link in ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall():
yield scrapy.Request(response.urljoin(ref_link), headers=self.headers, callback=self.parse_page, priority = 5)
next_page = response.css('ul.pagination li a.nxt::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, headers=self.headers, callback=self.parse)
def parse_page(self, response):
print("called")
Upvotes: 1
Reputation: 2224
You don't have to collect the links to an array, you can just yield
a scrapy.Request
right after you parsed them. So instead of self.seed_urls.extend(ref_links)
, you can modify the following function:
def iterate_urls(self, seed_urls):
for link in seed_urls:
link = response.urljoin(link)
yield scrapy.Request(link, headers=self.headers, callback=self.parse_page)
and call it:
...
for ul in response.css('div.t_newswrap'):
ref_links = ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall()
yield iterate_urls(ref_links)
...
Upvotes: 0