Reputation: 1321
I have never dealt with XML databases (or queried XML in a complex form), so this xquery is all new to me. I have read the Datypic book and I originally tried to parse the XML into a database, but the tags have meaning and a relational database adds more complexity.
I have some files containing transcripts and details about the words used, the structure is like this:
<text id="KBY">
<bncDoc xml:id="KBY">
<stext type="CONVRSN">
<u who="KBYPSUNK">
<w tag="UH" hw="hi" pos="INTERJ" sem="Z4" semo="|Z4|">Hi</w>
<w tag="YEX" hw="PUNC" pos="STOP" sem="" semo="|">!</w>
</u>
<u who="PS10L">
<w tag="VVGK" hw="going" pos="VERB" sem="T1:1:3" semo="|T1:1:3|">Gon</w>
<w tag="TO" hw="to" pos="PREP" sem="Z5" semo="|Z5|">na</w>
<w tag="RR21" hw="at" pos="ADV" sem="A13:7" semo="|A13:7;i1:2:1|">at</w>
<w tag="RR22" hw="least" pos="ADV" sem="A13:7" semo="|A13:7;i1:2:2|A13:7|">least</w>
<w tag="VVI" hw="stop" pos="VERB" sem="T2" semo="|T2d|S8d|M8|H4|A1:1:1|">stop</w>
<w tag="II" hw="at" pos="PREP" sem="Z5" semo="|Z5|">at</w>
<w tag="NP1" hw="gerald" pos="SUBST" sem="Z1" semo="|Z1m|">Gerald</w>
<w tag="GE" hw="'s" pos="UNC" sem="Z5" semo="|Z5|">'s</w>
<w tag="VHZ" hw="have" pos="VERB" sem="Z5" semo="|Z5|A9u|A2:2|S4|">has</w>
<w tag="XX" hw="not" pos="ADV" sem="Z6" semo="|Z6|">n't</w>
<w tag="PPHS1" hw="he" pos="PRON" sem="Z8" semo="|Z8m|">he</w>
<w tag="YQUE" hw="PUNC" pos="STOP" sem="" semo="|">?</w>
</u>
Trivially, I know I can query for single words using:
for $w in //w
where $w = "houses"
return $w
OR
for $w in //w//text()
where $w = "houses"
return $w
But I cannot, for the life of me, figure out how I could query for more than a single term. I.e. "There were three houses". This would involve checking that each word is neighbouring and is not in a separate u
tag. Ideally, I would be able to grab a few words before and after as well. I assume so far that this is difficult because of the structure but searching the plain file takes > 6 seconds and BaseX seems to be very efficient with this.
Any help is appreciated!
Upvotes: 1
Views: 58
Reputation: 163595
With XQuery 1.0 you can do something like
for $x at $p in w
where string-join(subsequence(w, $p, 4), ' ') = "There were three houses"
return ...
With XQuery 3.0 (or 3.1) you can use the new "for sliding window" clause, but I don't think it makes the answer any simpler than the above.
Upvotes: 0