scrapy: Remove some elements from an xpath selector

Question

I'm using Scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and sub-elements of a certain div, EXCEPT a few div in the middle. Here is the piece of code below :-


    

     
    
        Sample Text

Demo: http://example.com/dfa/asfa/aasfa


        
            http://www.coolfiles.ro/download/kleo13.rar/1098750
http://www.ainecreator.com/files/0MKOGM6D/kleo13.rar_links

        
        

        
            
                
                    
                        80
                        1
                        2
                        3
                        4
                        5
                    
                
                (votes: 3)
            
        
        
            
                Related News:
            
            1
            2
            3
            4
            5

The final output should look like :-


    

     
    
        Sample Text

Demo: http://example.com/dfa/asfa/aasfa


        
            http://www.coolfiles.ro/download/kleo13.rar/1098750
http://www.ainecreator.com/files/0MKOGM6D/kleo13.rar_links

Here is the piece of my Scrapy code. Please suggest the addition to this script :-

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from isbullshit.items import IsBullshitItem


class IsBullshitSpider(CrawlSpider):
    """ General configuration of the Crawl Spider """
    name = 'isbullshitwp'
    start_urls = ['http://example.com/themes'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
        # r'page/\d+' : regular expression for http://example.com/page/X URLs
        Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')]
        # r'\d{4}/\d{2}/\w+' : regular expression for http://example.com/YYYY/MM/title URLs

    def parse_blogpost(self, response):
        hxs = HtmlXPathSelector(response)
        item = IsBullshitItem()
        item['title'] = hxs.select('//span[@class="storytitle"]/text()').extract()[0]
        item['article_html'] = hxs.select("//div[@class='article']").extract()[0]

        return item

Here are the following xpath that I experimented with but did not get the desired results :-

item['article_html'] = hxs.select("//div[@class='article']").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/following::node() [not(preceding::div[@class='reln']) and not(@class='reln')]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/div[@class='reln']/preceding-sibling::node()[preceding-sibling::div[@class='quote']]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/following::node() [not(preceding::div[@class='reln'])]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/div[@class='quote']/*[not(self::div[@class='reln'])]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/*[(self::name()='reln'])]").extract()[0]

Thanks in advance...

scrapy: Remove some elements from an xpath selector

Answers (1)

Related Questions