Reputation: 2934
I've a bunch of HTML text streams, each containing the phrase "Toy:" once.
E.g.,
<p><b>Toy: </b><b>Train</b></p>
<p><b>Toy:</b><b>Chess game</b></p>
<p><b>Toy: </b><span>Guitar</span></p>
<p><b>Toy: </b>Doll</p>
<p><strong><ul>Toy: </ul></strong></b><b>Monkey costume</b></p>
<p><b>Toy: Train</b></p>
<p>Toy: Skipping rope</p>
<p>Toy:Snail</p>
I'd like to pull out the text from these.
e.g.,
Toy: Train
Toy:Chess game
Toy: Guitar
Toy: Doll
Toy: Monkey costume
Toy: Train
Toy: Skipping rope
Toy:Snail
I'm having trouble getting to a single xpath expression which I feel should be possible.
Example:
//p[starts-with(descendant-or-self::*/text(), "%s")]
Upvotes: 1
Views: 46
Reputation: 111726
First, XPath requires well-formed XML:
<root>
<p><b>Toy: </b><b>Train</b></p>
<p><b>Toy:</b><b>Chess game</b></p>
<p><b>Toy: </b><span>Guitar</span></p>
<p><b>Toy: </b>Doll</p>
<p><strong><ul>Toy: </ul></strong><b>Monkey costume</b></p>
<p><b>Toy: Train</b></p>
<p>Toy: Skipping rope</p>
<p>Toy:Snail</p>
</root>
Then, you can select all of the p
elements that start with Toy:
:
//p[starts-with(., 'Toy:')]
I'd like to pull out the text from these.
In pure XPath 1.0, you can do
//p[starts-with(., 'Toy:')]//text()
to retrieve the text nodes under the p
element starting with Toy:
, but each text node string will be on its own line rather than grouped per enclosing p
.
To keep the text grouped under each enclosing p
, you can step through the selected p
elements and get the string value of each element using whatever hosting language you're using to evaluate the XPath, or you could use XPath 2.0:
//p[starts-with(., 'Toy:')]/string()
will return
Toy: Train
Toy:Chess game
Toy: Guitar
Toy: Doll
Toy: Train
Toy: Skipping rope
Toy:Snail
Upvotes: 2