Carl
Carl

Reputation: 2934

Flexible text retrieval with XPath

I've a bunch of HTML text streams, each containing the phrase "Toy:" once.

E.g.,

<p><b>Toy: </b><b>Train</b></p>
<p><b>Toy:</b><b>Chess game</b></p>
<p><b>Toy: </b><span>Guitar</span></p>
<p><b>Toy: </b>Doll</p>
<p><strong><ul>Toy: </ul></strong></b><b>Monkey costume</b></p>
<p><b>Toy: Train</b></p>
<p>Toy: Skipping rope</p>
<p>Toy:Snail</p>

I'd like to pull out the text from these.

e.g.,

Toy: Train
Toy:Chess game
Toy: Guitar
Toy: Doll
Toy: Monkey costume
Toy: Train
Toy: Skipping rope
Toy:Snail

I'm having trouble getting to a single xpath expression which I feel should be possible.

Example:

//p[starts-with(descendant-or-self::*/text(), "%s")]

Upvotes: 1

Views: 46

Answers (1)

kjhughes
kjhughes

Reputation: 111726

First, XPath requires well-formed XML:

<root>
  <p><b>Toy: </b><b>Train</b></p>
  <p><b>Toy:</b><b>Chess game</b></p>
  <p><b>Toy: </b><span>Guitar</span></p>
  <p><b>Toy: </b>Doll</p>
  <p><strong><ul>Toy: </ul></strong><b>Monkey costume</b></p>
  <p><b>Toy: Train</b></p>
  <p>Toy: Skipping rope</p>
  <p>Toy:Snail</p>
</root>

Then, you can select all of the p elements that start with Toy::

//p[starts-with(., 'Toy:')]

I'd like to pull out the text from these.

In pure XPath 1.0, you can do

//p[starts-with(., 'Toy:')]//text()

to retrieve the text nodes under the p element starting with Toy:, but each text node string will be on its own line rather than grouped per enclosing p.

To keep the text grouped under each enclosing p, you can step through the selected p elements and get the string value of each element using whatever hosting language you're using to evaluate the XPath, or you could use XPath 2.0:

//p[starts-with(., 'Toy:')]/string()

will return

Toy: Train
Toy:Chess game
Toy: Guitar
Toy: Doll
Toy: Train
Toy: Skipping rope
Toy:Snail

Upvotes: 2

Related Questions