Reputation: 211

Xquery to extract text in html

I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:

              <td><a name="hw">HELLOWORLD</a>Hello world</td>

I need to extract "Hello world" text from the above html script.

I have tried extracting the text in this fashion:

     $hw :=data($item//a[@name='hw']/text())

However what I always get is "HELLOWORLD" instead of "Hello world".

Is there a way to extract "Hello World". Please help.

What if I want to do it this way:

<td>
 <a name="hw1">HELLOWORLD1</a>Hello world1
 <a name="hw2">HELLOWORLD2</a>Hello world2
 <a name="hw3">HELLOWORLD3</a>Hello world3
</td>

I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[@name='hw2'] and /a[@name='hw3'].

Upvotes: 3

Answers (3)

Dimitre Novatchev

Reputation: 243529

I would not want to use text()[3] but is there some way I could extract the text out between /a[@name='hw2'] and /a[@name='hw3'].

If there is just one text node between the two <a> elements, then the following would be quite simple:

/a[@name='hw3']/preceding::text()[1]

If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:

$ns1[count(.|$ns2) = count($ns2)]

So, just replace in the above expression $ns1 with:

/a[@name='hw2']/following-sibling::text()

and $ns2 with:

/a[@name='hw3']/preceding-sibling::text()

Lastly, if you really have XQuery (or XPath 2), then this is simply:

   /a[@name='hw2']/following-sibling::text() 

intersect

   /a[@name='hw3']/preceding-sibling::text()

Upvotes: 3

Dave Cassel

Reputation: 8422

This handles your expanded case, while letting you select by attribute value rather than position:

let $item := 
  <td>
    <a name="hw1">HELLOWORLD1</a>Hello world1
    <a name="hw2">HELLOWORLD2</a>Hello world2
    <a name="hw3">HELLOWORLD3</a>Hello world3
  </td>

return $item//node()[./preceding-sibling::a/@name = "hw2"][1]

This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".

Upvotes: 0

Oded

Reputation: 499132

Your xpath is selecting the text of the a nodes, not the text of the td nodes:

$item//a[@name='hw']/text()

Change it to this:

$item[a/@name='hw']/text()

Update (following comments and update to question):

This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:

$item[a/@name='hw']//text()[2]

Upvotes: 6

Xquery to extract text in html

Answers (3)

Related Questions