oeb
oeb

Reputation: 189

Extract all text in between two nodes using xpath for websrcaping?

       <div class="jokeContent">
            <h2 style="color:#369;">Can I be Frank</h2>
            What did Ellen Degeneres say to Kathy Lee? 
           <p></p> <p>Can I be Frank with you? </p> 
           <p>Submitted by Calamjo</p> 
           <p>Edited by Curtis</p>      
       <div align="right" style="margin-top:10px;margin-bottom:10px;">#joke <a href="http://www.jokesoftheday.net/tag/short-jokes/">#short</a> </div>
       <div style="clear:both;"></div>
    </div>

So I am trying to extract all text after the <\h2> and before the [div aign = "right" style=...] nodes. What I have tried so far:

    jokes = response.xpath('//div[@class="jokeContent"]')
    for joke in jokes:
        text = joke.xpath('text()[normalize-space()]').extract()]
        if len(text) > 0:
            yield text

This works to some extend, but the website is inconsistent in the html and sometimes the text is embedded in <.p> TEXT <\p> and sometimes in <.br> TEXT <\br> or just TEXT. So I thought just extracting everything after the header and before the style node might make sense and then the filtering can be done afterwords.

Upvotes: 0

Views: 969

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21446

If you are looking for a literal xpath of what you are describing, it could be something like:

In [1]: sel.xpath("//h2/following-sibling::*[not(self::div) and not(preceding-sibling::div)]//text()").extract()
Out[1]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']

But there's probably a more logical, cleaner conclusion:

In [2]: sel.xpath("//h2/following-sibling::p//text()").extract()
Out[2]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']

This is just selecting paragraph tags. You said the paragraph tags might be something else and you can match several different tags with self::tag specification:

In [3]: sel.xpath("//h2/following-sibling::*[self::p or self::br]//text()").extract()
Out[3]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']

Edit: apparently I missed the text under the div itself. This can be ammended with | - or selector:

In [3]: sel.xpath("//h2/../text()[normalize-space(.)] | //h2/../p//text()").extract()
Out[3]: 
[u'\n            What did Ellen Degeneres say to Kathy Lee? \n           ',
 u'Can I be Frank with you? ',
 u'Submitted by Calamjo',
 u'Edited by Curtis']

normalize-space(.) is there only to get rid of text values that contain no text (e.g. ' \n').
You can append the first part of this xpath to any of the above and you'd get similar results.

Upvotes: 2

Related Questions