Reputation: 35249
Here's an easy point for an XPath expert! :)
Document structure:
<tokens>
<token>
<word>Newt</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Gingrich</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>admires</word><entityType>VERB</entityType>
</token>
<token>
<word>Garry</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Trudeau</word><entityType>PROPER_NOUN</entityType>
</token>
</tokens>
Ignoring the semantic improbability of the document, I want to pull out [["Newt", "Gingrich"], ["Garry", "Trudeau"]], that is: when there are two tokens in a row whose entityTypes are PROPER_NOUN, I want to extract the words from those two tokens.
I've gotten as far as:
"//token[entityType='PROPER_NOUN']/following-sibling::token[1][entityType='PROPER_NOUN']"
... which gets as far as finding the second of two consecutive PROPER_NOUN tokens, but I'm not sure how to get it to emit the first token along with it.
Some notes:
Here's my solution using higher-level Ruby functions. But I'm tired of all those XPath bullies kicking sand in my face, and I'd like to know the way REAL XPath coders do it!
def extract(doc)
names = []
sentences = doc.xpath("//tokens")
sentences.each do |sentence|
tokens = sentence.xpath("token")
prev = nil
tokens.each do |token|
name = token.xpath("word").text if token.xpath("entityType").text == "PROPER_NOUN"
names << [prev, name] if (name && prev)
prev = name
end
end
names
end
Upvotes: 4
Views: 1822
Reputation: 163675
XPath alone isn't powerful enough for this task. But it's very easy in XSLT:
<xsl:for-each-group select="token" group-adjacent="entityType">
<xsl:if test="current-grouping-key="PROPER_NOUN">
<xsl:copy-of select="current-group">
<xsl:text>====</xsl:text>
<xsl:if>
</xsl:for-each-group>
Upvotes: 0
Reputation: 243619
This XPath 1.0 expression:
/*/token
[entityType='PROPER_NOUN'
and
following-sibling::token[1]/entityType = 'PROPER_NOUN'
]
/word
selects all "first-in-pair noun-words"
This XPath expression:
/*/token
[entityType='PROPER_NOUN'
and
preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
]
/word
Selects all "second-in-pair noun-words"
You'll have to produce the actual pairs taking the kth node of each of the two produced result node-sets.
XSLT-based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/token
[entityType='PROPER_NOUN'
and
following-sibling::token[1]/entityType = 'PROPER_NOUN'
]
/word
"/>
==============
<xsl:copy-of select=
"/*/token
[entityType='PROPER_NOUN'
and
preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
]
/word
"/>
</xsl:template>
</xsl:stylesheet>
simply evaluates the two XPath expressions and outputs the results of these two evaluations (using a suitable delimiter to visualize the end of the first result and the start of the second result).
When applied on the provided XML document:
<tokens>
<token>
<word>Newt</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Gingrich</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>admires</word><entityType>VERB</entityType>
</token>
<token>
<word>Garry</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Trudeau</word><entityType>PROPER_NOUN</entityType>
</token>
</tokens>
the output is:
<word>Newt</word>
<word>Garry</word>
==============
<word>Gingrich</word>
<word>Trudeau</word>
and the combining (zipping) of the two results (which you will specify in your favorite PL) is:
["Newt", "Gingrich"]
and
["Garry", "Trudeau"]
When the same transformation is applied on this XML document (note we now have one tripple):
<tokens>
<token>
<word>Newt</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Gingrich</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Rep</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>admires</word><entityType>VERB</entityType>
</token>
<token>
<word>Garry</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Trudeau</word><entityType>PROPER_NOUN</entityType>
</token>
</tokens>
the result now is:
<word>Newt</word>
<word>Gingrich</word>
<word>Garry</word>
==============
<word>Gingrich</word>
<word>Rep</word>
<word>Trudeau</word>
and zipping the two results produces the correct, wanted final result:
["Newt", "Gingrich"],
["Gingrich", "Rep"],
and
["Garry", "Trudeau"]
Do Note:
The wanted result can be produced using a single XPath 2.0 expression. Do let me know if you are interested in an XPath 2.0 solution.
Upvotes: 1
Reputation: 37527
XPath returns a node or a nodeset, but doesn't return groups. So you have to identify the start of each group, then grab the rest.
first = "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]/word"
next = "../following-sibling::token[1]/word"
doc.xpath(first).map{|word| [word.text, word.xpath(next).text] }
Output:
[["Newt", "Gingrich"], ["Garry", "Trudeau"]]
Upvotes: 0
Reputation: 10582
I'd do this in two steps. First step is to select a set of nodes:
//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]
This gives you all the token
s that start a 2-word pair. Then to get the actual pair, iterate over the node list and extract ./word
and following-sibling::token[1]/word
Using XmlStarlet ( http://xmlstar.sourceforge.net/ - awesome tool for quick xml manipulation) the command line is
xml sel -t -m "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]" -v word -o "," -v "following-sibling::token[1]/word" -n /tmp/tok.xml
giving
Newt,Gingrich
Garry,Trudeau
XmlStarlet will also compile that command line to xslt, the relevant bit is
<xsl:for-each select="//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]">
<xsl:value-of select="word"/>
<xsl:value-of select="','"/>
<xsl:value-of select="following-sibling::token[1]/word"/>
<xsl:value-of select="' '"/>
</xsl:for-each>
Using Nokogiri it could look something like:
#parse the document
doc = Nokogiri::XML(the_document_string)
#select all tokens that start 2-word pair
pair_starts = doc.xpath '//token[entityType = "PROPER_NOUN" and following-sibling::token[1][entityType = "PROPER_NOUN"]]'
#extract each word and the following one
result = pair_starts.each_with_object([]) do |node, array|
array << [node.at_xpath('word').text, node.at_xpath('following-sibling::token[1]/word').text]
end
Upvotes: 1