Reputation: 35249

finding consecutive siblings with XPath

Here's an easy point for an XPath expert! :)

Document structure:

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

Ignoring the semantic improbability of the document, I want to pull out [["Newt", "Gingrich"], ["Garry", "Trudeau"]], that is: when there are two tokens in a row whose entityTypes are PROPER_NOUN, I want to extract the words from those two tokens.

I've gotten as far as:

"//token[entityType='PROPER_NOUN']/following-sibling::token[1][entityType='PROPER_NOUN']"

... which gets as far as finding the second of two consecutive PROPER_NOUN tokens, but I'm not sure how to get it to emit the first token along with it.

Some notes:

I don't mind doing higher-level processing of the NodeSets (e.g. in Ruby / Nokogiri) if that simplifies the problem.
In the event that there are three or more consecutive PROPER_NOUN tokens (call them A, B, C), ideally I'd like to emit [A, B], [B, C].

update

Here's my solution using higher-level Ruby functions. But I'm tired of all those XPath bullies kicking sand in my face, and I'd like to know the way REAL XPath coders do it!

def extract(doc)
  names = []
  sentences = doc.xpath("//tokens")
  sentences.each do |sentence| 
    tokens = sentence.xpath("token")
    prev = nil
    tokens.each do |token|
      name = token.xpath("word").text if token.xpath("entityType").text == "PROPER_NOUN"
      names << [prev, name] if (name && prev)
      prev = name
    end
  end
  names
end

Upvotes: 4

Answers (4)

Michael Kay

Reputation: 163675

XPath alone isn't powerful enough for this task. But it's very easy in XSLT:

<xsl:for-each-group select="token" group-adjacent="entityType">
  <xsl:if test="current-grouping-key="PROPER_NOUN">
     <xsl:copy-of select="current-group">
     <xsl:text>====</xsl:text>
  <xsl:if>
</xsl:for-each-group>

Upvotes: 0

Dimitre Novatchev

Reputation: 243619

This XPath 1.0 expression:

   /*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word

selects all "first-in-pair noun-words"

This XPath expression:

/*/token
  [entityType='PROPER_NOUN'
 and
   preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
  ]
   /word

Selects all "second-in-pair noun-words"

You'll have to produce the actual pairs taking the kth node of each of the two produced result node-sets.

XSLT-based verification:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
==============
  <xsl:copy-of select=
   "/*/token
      [entityType='PROPER_NOUN'
     and
       preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
 </xsl:template>
</xsl:stylesheet>

simply evaluates the two XPath expressions and outputs the results of these two evaluations (using a suitable delimiter to visualize the end of the first result and the start of the second result).

When applied on the provided XML document:

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

the output is:

<word>Newt</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Trudeau</word>

and the combining (zipping) of the two results (which you will specify in your favorite PL) is:

["Newt", "Gingrich"]

and

["Garry", "Trudeau"]

When the same transformation is applied on this XML document (note we now have one tripple):

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Rep</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

the result now is:

<word>Newt</word>
<word>Gingrich</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Rep</word>
<word>Trudeau</word>

and zipping the two results produces the correct, wanted final result:

["Newt", "Gingrich"],

["Gingrich", "Rep"],

and

["Garry", "Trudeau"]

Do Note:

The wanted result can be produced using a single XPath 2.0 expression. Do let me know if you are interested in an XPath 2.0 solution.

Upvotes: 1

Mark Thomas

Reputation: 37527

XPath returns a node or a nodeset, but doesn't return groups. So you have to identify the start of each group, then grab the rest.

first = "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]/word"
next = "../following-sibling::token[1]/word"

doc.xpath(first).map{|word| [word.text, word.xpath(next).text] }

Output:

[["Newt", "Gingrich"], ["Garry", "Trudeau"]]

Upvotes: 0

evil otto

Reputation: 10582

I'd do this in two steps. First step is to select a set of nodes:

//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]

This gives you all the tokens that start a 2-word pair. Then to get the actual pair, iterate over the node list and extract ./word and following-sibling::token[1]/word

Using XmlStarlet ( http://xmlstar.sourceforge.net/ - awesome tool for quick xml manipulation) the command line is

xml sel -t -m "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]" -v word -o "," -v "following-sibling::token[1]/word" -n /tmp/tok.xml

giving

Newt,Gingrich
Garry,Trudeau

XmlStarlet will also compile that command line to xslt, the relevant bit is

  <xsl:for-each select="//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]">
    <xsl:value-of select="word"/>
    <xsl:value-of select="','"/>
    <xsl:value-of select="following-sibling::token[1]/word"/>
    <xsl:value-of select="'&#10;'"/>
  </xsl:for-each>

Using Nokogiri it could look something like:

#parse the document
doc = Nokogiri::XML(the_document_string)

#select all tokens that start 2-word pair
pair_starts = doc.xpath '//token[entityType = "PROPER_NOUN" and following-sibling::token[1][entityType = "PROPER_NOUN"]]'

#extract each word and the following one
result = pair_starts.each_with_object([]) do |node, array|
  array << [node.at_xpath('word').text, node.at_xpath('following-sibling::token[1]/word').text]
end

Upvotes: 1

finding consecutive siblings with XPath

update

Answers (4)

Related Questions