Reputation: 45
I am hoping my request is quite simple and straightforward for the more experienced Scrapy users out there.
In essence, the following code works well for scraping from a second page based on a link in the first page. I would like to extend the code to scrape from a 3rd page, using a link in the second page. Using the code below, def parse_items
is the landing page (1st level) which contains 50 listings and the code is set up to recursively scrape from each of the 50 links. def parse_listing_page
specifies which items to scrape from the "listing page". Within each listing page, I would like my script to follow a link through to another page and scrape an item or two before returning to the "listing page" and then back to the landing page.
The code below works well for recursively scraping at 2 levels. How could I expand this to 3 using my code below?
from scrapy import log
from scrapy.log import ScrapyFileLogObserver
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from firstproject.items import exampleItem
from scrapy.http import Request
import urlparse
logfile_info = open('example_INFOlog.txt', 'a')
logfile_error = open('example_ERRlog.txt', 'a')
log_observer_info = log.ScrapyFileLogObserver(logfile_info, level=log.INFO)
log_observer_error = log.ScrapyFileLogObserver(logfile_error, level=log.ERROR)
log_observer_info.start()
log_observer_error.start()
class MySpider(CrawlSpider):
name = "example"
allowed_domains = ["example.com.au"]
rules = (Rule (SgmlLinkExtractor(allow=("",),restrict_xpaths=('//li[@class="nextLink"]',))
, callback="parse_items", follow=True),
)
def start_requests(self):
start_urls = reversed([
"http://www.example.com.au/1?new=true&list=10-to-100",
"http://www.example.com.au/2?new=true&list=10-to-100",
"http://www.example.com.au/2?new=true&list=100-to-200",
])
return[Request(url = start_url) for start_url in start_urls ]
def parse_start_url(self, response):
return self.parse_items(response)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
listings = hxs.select("//h2")
items = []
for listings in listings:
item = exampleItem()
item ["title"] = listings.select("a/text()").extract()[0]
item ["link"] = listings.select("a/@href").extract()[0]
items.append(item)
url = "http://example.com.au%s" % item["link"]
yield Request(url=url, meta={'item':item},callback=self.parse_listing_page)
def parse_listing_page(self,response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item["item_1"] = hxs.select('#censored Xpath').extract()
item["item_2"] = hxs.select('#censored Xpath').extract()
item["item_3"] = hxs.select('#censored Xpath').extract()
item["item_4"] = hxs.select('#censored Xpath').extract()
return item
Many thanks
Upvotes: 1
Views: 619
Reputation: 45
Here is my updated code. The code below is able to pull the counter_link
in an appropriate format (tested), but it seems like the else
statement is used and so the parse_listing_counter
isn't yielded. If I remove the if
and else
clauses and force the code to callback parse_listing_counter
, it doesn't yield any items (not even those from parse_items
or the listing page).
What have I done wrong in my code? I've also checked the XPaths - all seem ok.
from scrapy import log
from scrapy.log import ScrapyFileLogObserver
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from firstproject.items import exampleItem
from scrapy.http import Request
import urlparse
logfile_info = open('example_INFOlog.txt', 'a')
logfile_error = open('example_ERRlog.txt', 'a')
log_observer_info = log.ScrapyFileLogObserver(logfile_info, level=log.INFO)
log_observer_error = log.ScrapyFileLogObserver(logfile_error, level=log.ERROR)
log_observer_info.start()
log_observer_error.start()
class MySpider(CrawlSpider):
name = "example"
allowed_domains = ["example.com.au"]
rules = (Rule (SgmlLinkExtractor(allow=("",),restrict_xpaths=('//li[@class="nextLink"]',))
, callback="parse_items", follow=True),
)
def start_requests(self):
start_urls = reversed([
"http://www.example.com.au/1?new=true&list=10-to-100",
"http://www.example.com.au/2?new=true&list=10-to-100",
"http://www.example.com.au/2?new=true&list=100-to-200",
])
return[Request(url = start_url) for start_url in start_urls ]
def parse_start_url(self, response):
return self.parse_items(response)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
listings = hxs.select("//h2")
items = []
for listings in listings:
item = exampleItem()
item ["title"] = listings.select("a/text()").extract()[0]
item ["link"] = listings.select("a/@href").extract()[0]
items.append(item)
url = "http://example.com.au%s" % item["link"]
yield Request(url=url, meta={'item':item},callback=self.parse_listing_page)
def parse_listing_page(self,response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item["item_1"] = hxs.select('#censored Xpath').extract()
item["item_2"] = hxs.select('#censored Xpath').extract()
item["item_3"] = hxs.select('#censored Xpath').extract()
item["item_4"] = hxs.select('#censored Xpath').extract()
item["counter_link"] = hxs.selext('#censored Xpath').extract()[0]
counter_link = response.meta.get('counter_link', None)
if counter_link:
url2 = "http://example.com.au%s" % item["counter_link"]
yield Request(url=url2, meta={'item':item},callback=self.parse_listing_counter)
else:
yield item
def parse_listing_counter(self,response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item["counter"] = hxs.select('#censored Xpath').extract()
return item
Upvotes: 1
Reputation: 12092
This is how the flow of your code works.
The Rule
constructor in the MySpider
class is invoked to start with. Rule
constructor has the callback set to parse_items
. There is a yield
at the end of the parse_items
which makes the function recurse to parse_listing_page
. If you want to recurse to a third level from parse_listing_page
there has to be a Request
yield from parse_listing_page
.
Upvotes: 1