Katherine Cavanaugh
Katherine Cavanaugh

Reputation: 85

Xpath selector not working in Scrapy

I'm trying to extract the text from this Xpath:

//*/li[contains(., "Full Name")]/span/text()

from this webpage: http://votesmart.org/candidate/biography/56110/norma-smith#.V9SwdZMrKRs

I've tested it in Google Chrome's Console (which works), as with many other variations of the Xpath, but I can't get it to work with Scrapy. My code only returns "{}".

Here's where I have been testing it in my code, for context:

 def parse_bio(self, response):  
    loader = response.meta['loader']
    fullnameValue = response.xpath('//*/li[contains(., "Full Name")]/span/text()').extract()
    loader.add_value('fullName', fullnameValue)
    return loader.load_item()

The problem isn't my code (I don't think), it works fine with other (very broad) Xpath selectors. But I'm not sure what's wrong with the Xpath. I have JavaScript disabled, if that makes a difference. Any help would be great!

Edit: Here is the rest of the code to make it more clear:

from scrapy import Spider, Request, Selector
from votesmart.items import LegislatorsItems, TheLoader



class VSSpider(Spider):
name = "vs"
allowed_domains = ["votesmart.org"]
start_urls = ["https://votesmart.org/officials/WA/L/washington-state-legislative"]


def parse(self, response):
    for href in response.xpath('//h5/a/@href').extract():
        person_url = response.urljoin(href)
        yield Request(person_url, callback=self.candidatesPoliticalSummary)

def candidatesPoliticalSummary(self, response): 
    item = LegislatorsItems()
    l = TheLoader(item=LegislatorsItems(), response=response)


   ...
   #populating items with item loader. works fine

    # create right bio url and pass item loader to it
    bio_url = response.url.replace('votesmart.org/candidate/', 
                                   'votesmart.org/candidate/biography/')
    return Request(bio_url, callback=self.parse_bio, meta={'loader': l})

def parse_bio(self, response):  
    loader = response.meta['loader']
    print response.request.url
    loader.add_xpath('fullName', '//*/li[contains(., "Full Name")]/span/text()')
    return loader.load_item()

Upvotes: 2

Views: 3071

Answers (2)

Katherine Cavanaugh
Katherine Cavanaugh

Reputation: 85

I figured out my problem! Many pages on the site were login protected, and I wasn't able to scrape from pages that I couldn't access in the first place. Scrapy's form request did the trick. Thanks for all the help (especially the suggestion of using view(response), which is super helpful).

Upvotes: 1

alecxe
alecxe

Reputation: 473763

The expression is working for me in the shell perfectly as is:

$ scrapy shell "http://votesmart.org/candidate/biography/56110/norma-smith#.V9SwdZMrKRs"
In [1]: response.xpath('//*/li[contains(., "Full Name")]/span/text()').extract()
Out[1]: [u'Norma Smith']

Try using the add_xpath() method instead:

loader.add_xpath('fullName', '//*/li[contains(., "Full Name")]/span/text()')

Upvotes: 0

Related Questions