Reputation: 33
Okay well I know why because nothing is being extracted for the next_page variable but Im not sure if im using xpath correctly
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
class SunBizSpider(scrapy.Spider):
name = 'sunbiz'
start_urls = ['http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResults?inquiryType=EntityName&searchNameOrder=A&searchTerm=a']
def parse(self, response):
for href in response.css('.large-width a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
re1='((?:[0]?[1-9]|[1][012])[-:\\/.](?:(?:[0-2]?\\d{1})|(?:[3][01]{1}))[-:\\/.](?:(?:[1]{1}\\d{1}\\d{1}\\d{1})|(?:[2]{1}\\d{3})))(?![\\d])' # MMDDYYYY 1
hxs = HtmlXPathSelector(response)
date = response.xpath('//span').re_first(re1)
next_page = hxs.select("//div[@class='navigationBar']/@href").extract()
yield {
'Name': response.css('.corporationName span::text').extract()[1],
'Date': date,
'Link': response.url,
}
if next_page:
yield scrapy.Request(next_page[1], callback=self.parse_question)
Upvotes: 0
Views: 174
Reputation: 18799
first, you don't need HtmlXPathSelector
if you are already using response
as a selector. response
can handle css and xpath, so don't worry about it.
Second, you are trying to get a link with this xpath "//div[@class='navigationBar']/@href"
, which says get the href
attribute from a div, which you should agree that is incorrect, href
attributes come on <a>
tags, so in this case the xpath you should use is:
"//div[@class='navigationBar'][1]//a[@title='Next On List']/@href"
Upvotes: 1