KumZ
KumZ

Reputation: 595

How to extract text with html link?

I try to parse an HTML page using BaseX. From this part of code:

 <td colspan="2" rowspan="1" class="light comment2 last2">
  <img class="textalign10" src="templates/comment10.png" 
       alt="*" width="10" height="10" border="0"/>
  <a shape="rect" href="mypage.php?userid=26682">user</a>
  : the text I'd like to keep [<a shape="rect" 
  href="http://alink" rel="nofollow">Link</a>] . with that part too.
 </td>

I need to extract the message with the a HTML link, and remove the first : characters at the beginning.

I would like to obtain this exact text:

<message>
the text I'd like to keep [<a shape="rect" href="http://alink" rel="nofollow">Link</a>] . with that part too.
</message>

Using this function,

declare
 function gkm:node_message_from_comment($comment as item()*) {
  if ($comment) then
    copy $c := $comment
    modify (
      delete node $c/img[1],
      delete node $c/a[1],
      delete node $c/@*,
      rename node $c as 'message'
    )
    return $c
  else ()
};

I can extract the text, but I failed to remove the : from the begining. ie:

<message>
: the text I'd like to keep [<a shape="rect" href="http://alink" rel="nofollow">Link</a>] . with that part too.
</message>

Upvotes: 2

Views: 114

Answers (1)

Jens Erat
Jens Erat

Reputation: 38702

Using XQuery Update and transformation statements seems a little bit overcomplicated to me. You can also select the nodes following the mypage.php link; with more knowledge on the input, there might also be better ways to select the required nodes.

To cut of the : substring, use substring-after. The pattern "cut off : from the first result node, and return all others as is" is also applicable when using transform statements, if you insist on using them.

let $comment :=<td colspan="2" rowspan="1" class="light comment2 last2">
  <img class="textalign10" src="templates/comment10.png" alt="*" width="10" height="10" border="0"/>
  <a shape="rect" href="mypage.php?userid=26682">user</a>
  : the text I'd like to keep [<a shape="rect" href="http://alink" rel="nofollow">Link</a>] . with that part too.
 </td>
let $result := $comment/a[starts-with(@href, 'mypage.php')]/following-sibling::node()
return <message>{
  $result[1]/substring-after(., ': '),
  $result[position() > 1]
}</message>

As BaseX supports XQuery 3.0, you could also take advantage of the helper functions head and tail:

return <message>{
  head($result)/substring-after(., ': '),
  tail($result)
}</message>

Upvotes: 3

Related Questions