Reputation: 2809
Follow-up to: HTML XPath: Extracting text mixed in with multiple tags?
I've made my test case more difficult:
<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
<ol>
<li><small>Military</small> Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li>
<li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.<br/>Renowned cooking school.</li>
</ol>
</div>
I have the same goal, namely, extracting:
Can I selectively choose which tags are excluded?
I've tried things like (for removing 'Military'):
id('mw-content-text')/ol/li[not(self::small)]
but that condition is applied to the 'li' node as a whole, so it's not affected.
And if I do something similar
id('mw-content-text')/ol/li/*[not(self::small)]
then I'm only filtering on the children, and even though I successfully throw away 'Military', I've also thrown away 'Central', 'Culinary', i.e. text from the parent.
I had understood the tree to be something like:
div -- li
-- small -- Military
-- Central
-- a -- Intelligence Agency
-- li
-- Culinary
-- a -- Institute
-- of
-- a -- America
-- br
-- Renowned cooking school.
Is that correct? Is there a way to say 'text elements of li and li's descendents EXCEPT descendents of small?' How about '... EXCEPT a br element and all following text elements'?
Again, use of (partial) Pythonic solutions are also acceptable, though XPath is preferred.
After sitting down to read Chapter 6 'XPath and XPointer' of 'Learning XML, Second Edition' by Erik Ray, I think I've got a grasp on it. I came up with the following formulation:
id('mw-content-text')/ol/li//text()[not(parent::small) and not(preceding-sibling::br)]
In this case, it doesn't seem possible to concatenate the resulting node set of text nodes. When we simply feed the 'li' element to the string function, then the resulting string-value is simply a concatenation of element node li's descendants. But in this case, we need to do further filtering, such that we result in a node set (of qualifying text nodes) instead of a single element node. Regarding concatenating node sets, a helpful SO question can be found here: XPath to return string concatenation of qualifying child node values
Any advice how to improve this solution?
Upvotes: 2
Views: 926
Reputation: 243459
Use:
/*/ol/li/descendant-or-self::*
[text() and not(self::small)]
/text()[not(preceding-sibling::br)]
Upvotes: 2