Reputation: 189
<div class="jokeContent">
<h2 style="color:#369;">Can I be Frank</h2>
What did Ellen Degeneres say to Kathy Lee?
<p></p> <p>Can I be Frank with you? </p>
<p>Submitted by Calamjo</p>
<p>Edited by Curtis</p>
<div align="right" style="margin-top:10px;margin-bottom:10px;">#joke <a href="http://www.jokesoftheday.net/tag/short-jokes/">#short</a> </div>
<div style="clear:both;"></div>
</div>
So I am trying to extract all text after the <\h2> and before the [div aign = "right" style=...] nodes. What I have tried so far:
jokes = response.xpath('//div[@class="jokeContent"]')
for joke in jokes:
text = joke.xpath('text()[normalize-space()]').extract()]
if len(text) > 0:
yield text
This works to some extend, but the website is inconsistent in the html and sometimes the text is embedded in <.p> TEXT <\p> and sometimes in <.br> TEXT <\br> or just TEXT. So I thought just extracting everything after the header and before the style node might make sense and then the filtering can be done afterwords.
Upvotes: 0
Views: 969
Reputation: 21446
If you are looking for a literal xpath of what you are describing, it could be something like:
In [1]: sel.xpath("//h2/following-sibling::*[not(self::div) and not(preceding-sibling::div)]//text()").extract()
Out[1]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
But there's probably a more logical, cleaner conclusion:
In [2]: sel.xpath("//h2/following-sibling::p//text()").extract()
Out[2]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
This is just selecting paragraph tags. You said the paragraph tags might be something else and you can match several different tags with self::tag
specification:
In [3]: sel.xpath("//h2/following-sibling::*[self::p or self::br]//text()").extract()
Out[3]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
Edit: apparently I missed the text under the div itself. This can be ammended with |
- or selector:
In [3]: sel.xpath("//h2/../text()[normalize-space(.)] | //h2/../p//text()").extract()
Out[3]:
[u'\n What did Ellen Degeneres say to Kathy Lee? \n ',
u'Can I be Frank with you? ',
u'Submitted by Calamjo',
u'Edited by Curtis']
normalize-space(.)
is there only to get rid of text values that contain no text (e.g. ' \n').
You can append the first part of this xpath to any of the above and you'd get similar results.
Upvotes: 2