ChaimKut
ChaimKut

Reputation: 2809

HTML XPath: Selectively avoiding tags when extracting text

Follow-up to: HTML XPath: Extracting text mixed in with multiple tags?

I've made my test case more difficult:

<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
<ol>
<li><small>Military</small> Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li>
<li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.<br/>Renowned cooking school.</li>
</ol>

</div>  

I have the same goal, namely, extracting:

Can I selectively choose which tags are excluded?

I've tried things like (for removing 'Military'):

id('mw-content-text')/ol/li[not(self::small)]

but that condition is applied to the 'li' node as a whole, so it's not affected.

And if I do something similar

id('mw-content-text')/ol/li/*[not(self::small)]

then I'm only filtering on the children, and even though I successfully throw away 'Military', I've also thrown away 'Central', 'Culinary', i.e. text from the parent.

I had understood the tree to be something like:

div -- li  
          -- small -- Military  
          -- Central  
          -- a     -- Intelligence Agency  
    -- li  
          -- Culinary  
          -- a     -- Institute  
          -- of  
          -- a    -- America  
          -- br  
          -- Renowned cooking school.  

Is that correct? Is there a way to say 'text elements of li and li's descendents EXCEPT descendents of small?' How about '... EXCEPT a br element and all following text elements'?

Again, use of (partial) Pythonic solutions are also acceptable, though XPath is preferred.


After sitting down to read Chapter 6 'XPath and XPointer' of 'Learning XML, Second Edition' by Erik Ray, I think I've got a grasp on it. I came up with the following formulation:

id('mw-content-text')/ol/li//text()[not(parent::small) and not(preceding-sibling::br)]

In this case, it doesn't seem possible to concatenate the resulting node set of text nodes. When we simply feed the 'li' element to the string function, then the resulting string-value is simply a concatenation of element node li's descendants. But in this case, we need to do further filtering, such that we result in a node set (of qualifying text nodes) instead of a single element node. Regarding concatenating node sets, a helpful SO question can be found here: XPath to return string concatenation of qualifying child node values

Any advice how to improve this solution?

Upvotes: 2

Views: 926

Answers (1)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243459

Use:

 /*/ol/li/descendant-or-self::*
          [text() and not(self::small)]
              /text()[not(preceding-sibling::br)]

Upvotes: 2

Related Questions