Reputation: 211
I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[@name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[@name='hw2'] and /a[@name='hw3'].
Upvotes: 3
Views: 13274
Reputation: 243529
I would not want to use text()[3] but is there some way I could extract the text out between
/a[@name='hw2'] and /a[@name='hw3']
.
If there is just one text node between the two <a>
elements, then the following would be quite simple:
/a[@name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1
with:
/a[@name='hw2']/following-sibling::text()
and $ns2
with:
/a[@name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[@name='hw2']/following-sibling::text()
intersect
/a[@name='hw3']/preceding-sibling::text()
Upvotes: 3
Reputation: 8422
This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/@name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".
Upvotes: 0
Reputation: 499132
Your xpath is selecting the text of the a
nodes, not the text of the td
nodes:
$item//a[@name='hw']/text()
Change it to this:
$item[a/@name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item
that have an a
tag containing a name
attribute set to hw
:
$item[a/@name='hw']//text()[2]
Upvotes: 6