Extract all text in between two nodes using xpath for websrcaping?

Question

       
            Can I be Frank
            What did Ellen Degeneres say to Kathy Lee? 
           
 Can I be Frank with you?  
           Submitted by Calamjo 
           Edited by Curtis      
       #joke #short

So I am trying to extract all text after the <\h2> and before the [div aign = "right" style=...] nodes. What I have tried so far:

    jokes = response.xpath('//div[@class="jokeContent"]')
    for joke in jokes:
        text = joke.xpath('text()[normalize-space()]').extract()]
        if len(text) > 0:
            yield text

This works to some extend, but the website is inconsistent in the html and sometimes the text is embedded in <.p> TEXT <\p> and sometimes in <.br> TEXT <\br> or just TEXT. So I thought just extracting everything after the header and before the style node might make sense and then the filtering can be done afterwords.

Granitosaurus · Accepted Answer

If you are looking for a literal xpath of what you are describing, it could be something like:

In [1]: sel.xpath("//h2/following-sibling::*[not(self::div) and not(preceding-sibling::div)]//text()").extract()
Out[1]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']

But there's probably a more logical, cleaner conclusion:

In [2]: sel.xpath("//h2/following-sibling::p//text()").extract()
Out[2]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']

This is just selecting paragraph tags. You said the paragraph tags might be something else and you can match several different tags with self::tag specification:

In [3]: sel.xpath("//h2/following-sibling::*[self::p or self::br]//text()").extract()
Out[3]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']

Edit: apparently I missed the text under the div itself. This can be ammended with | - or selector:

In [3]: sel.xpath("//h2/../text()[normalize-space(.)] | //h2/../p//text()").extract()
Out[3]: 
[u'
            What did Ellen Degeneres say to Kathy Lee? 
           ',
 u'Can I be Frank with you? ',
 u'Submitted by Calamjo',
 u'Edited by Curtis']

normalize-space(.) is there only to get rid of text values that contain no text (e.g. ' ').
You can append the first part of this xpath to any of the above and you'd get similar results.

Extract all text in between two nodes using xpath for websrcaping?

Answers (1)

Related Questions