Sahil Thapar
Sahil Thapar

Reputation: 301

xpath not getting selected

I have just started using Scrapy: Here is an example of a website that I want to crawl :

http://www.thefreedictionary.com/shame

The code for my Spider :

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from dic_crawler.items import DicCrawlerItem

from urlBuilder import *   

class Dic_crawler(BaseSpider):
    name = "dic"
    allowed_domains = ["www.thefreedictionary.com"]
    start_urls = listmaker()[:]
    print start_urls

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//*[@id="MainTxt"]/table/tbody')
        print 'SITES:\n',sites


        item = DicCrawlerItem()

        item["meanings"] = sites.select('//*[@id="MainTxt"]/table/tbody/tr/td/div[1]/div[1]/div[1]/text()').extract()

        print item

        return item

The listmaker() returns a list of urls to scrap.

My problem is that the sites variable comes up empty if I select till 'tbody' in the xpath and returns an empty sites variable, Whereas if I select only table I get the part of the site I want.

I am not able to retrieve the meaning for a word as a result of this into item["meanings"] since the part after tbody is does not select beyond tbody.

Also while at it, the site gives multiple meanings which I would like to extract but I only know how to extract a single method.

Thanks

Upvotes: 0

Views: 110

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

Here's a spider skeleton to get you started:

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector

class Dic_crawler(BaseSpider):
    name = "thefreedictionary"
    allowed_domains = ["www.thefreedictionary.com"]
    start_urls = ['http://www.thefreedictionary.com/shame']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        # loop on each "noun" or "verb" or something... section
        for category in hxs.select('id("MainTxt")//div[@class="pseg"]'):

            # this is simply to get what's in the <i> tag
            category_name = u''.join(category.select('./i/text()').extract())
            self.log("category: %s" % category_name)

            # for each category, a term can have multiple definition
            # category from .select() is a selector
            # so you can call .select() on it also,
            # here with a relative XPath expression selecting all definitions
            for definition in category.select('div[@class="ds-list"]'):
                definition_text = u'\n'.join(
                    definition.select('.//text()').extract())
                self.log(" - definition: %s" % definition_text)

Upvotes: 1

Related Questions