Reputation: 85
I'm trying to extract the text from this Xpath:
//*/li[contains(., "Full Name")]/span/text()
from this webpage: http://votesmart.org/candidate/biography/56110/norma-smith#.V9SwdZMrKRs
I've tested it in Google Chrome's Console (which works), as with many other variations of the Xpath, but I can't get it to work with Scrapy. My code only returns "{}".
Here's where I have been testing it in my code, for context:
def parse_bio(self, response):
loader = response.meta['loader']
fullnameValue = response.xpath('//*/li[contains(., "Full Name")]/span/text()').extract()
loader.add_value('fullName', fullnameValue)
return loader.load_item()
The problem isn't my code (I don't think), it works fine with other (very broad) Xpath selectors. But I'm not sure what's wrong with the Xpath. I have JavaScript disabled, if that makes a difference. Any help would be great!
Edit: Here is the rest of the code to make it more clear:
from scrapy import Spider, Request, Selector
from votesmart.items import LegislatorsItems, TheLoader
class VSSpider(Spider):
name = "vs"
allowed_domains = ["votesmart.org"]
start_urls = ["https://votesmart.org/officials/WA/L/washington-state-legislative"]
def parse(self, response):
for href in response.xpath('//h5/a/@href').extract():
person_url = response.urljoin(href)
yield Request(person_url, callback=self.candidatesPoliticalSummary)
def candidatesPoliticalSummary(self, response):
item = LegislatorsItems()
l = TheLoader(item=LegislatorsItems(), response=response)
...
#populating items with item loader. works fine
# create right bio url and pass item loader to it
bio_url = response.url.replace('votesmart.org/candidate/',
'votesmart.org/candidate/biography/')
return Request(bio_url, callback=self.parse_bio, meta={'loader': l})
def parse_bio(self, response):
loader = response.meta['loader']
print response.request.url
loader.add_xpath('fullName', '//*/li[contains(., "Full Name")]/span/text()')
return loader.load_item()
Upvotes: 2
Views: 3071
Reputation: 85
I figured out my problem! Many pages on the site were login protected, and I wasn't able to scrape from pages that I couldn't access in the first place. Scrapy's form request did the trick. Thanks for all the help (especially the suggestion of using view(response)
, which is super helpful).
Upvotes: 1
Reputation: 473763
The expression is working for me in the shell perfectly as is:
$ scrapy shell "http://votesmart.org/candidate/biography/56110/norma-smith#.V9SwdZMrKRs"
In [1]: response.xpath('//*/li[contains(., "Full Name")]/span/text()').extract()
Out[1]: [u'Norma Smith']
Try using the add_xpath()
method instead:
loader.add_xpath('fullName', '//*/li[contains(., "Full Name")]/span/text()')
Upvotes: 0