Christopher Gwilliams
Christopher Gwilliams

Reputation: 1321

XQuery - Query neighbouring tags

I have never dealt with XML databases (or queried XML in a complex form), so this xquery is all new to me. I have read the Datypic book and I originally tried to parse the XML into a database, but the tags have meaning and a relational database adds more complexity.

I have some files containing transcripts and details about the words used, the structure is like this:

<text id="KBY">
  <bncDoc xml:id="KBY">
    <stext type="CONVRSN">
      <u who="KBYPSUNK">
        <w tag="UH" hw="hi" pos="INTERJ" sem="Z4" semo="|Z4|">Hi</w>
        <w tag="YEX" hw="PUNC" pos="STOP" sem="" semo="|">!</w>
      </u>
      <u who="PS10L">
        <w tag="VVGK" hw="going" pos="VERB" sem="T1:1:3" semo="|T1:1:3|">Gon</w>
        <w tag="TO" hw="to" pos="PREP" sem="Z5" semo="|Z5|">na</w>
        <w tag="RR21" hw="at" pos="ADV" sem="A13:7" semo="|A13:7;i1:2:1|">at</w>
        <w tag="RR22" hw="least" pos="ADV" sem="A13:7" semo="|A13:7;i1:2:2|A13:7|">least</w>
        <w tag="VVI" hw="stop" pos="VERB" sem="T2" semo="|T2d|S8d|M8|H4|A1:1:1|">stop</w>
        <w tag="II" hw="at" pos="PREP" sem="Z5" semo="|Z5|">at</w>
        <w tag="NP1" hw="gerald" pos="SUBST" sem="Z1" semo="|Z1m|">Gerald</w>
        <w tag="GE" hw="'s" pos="UNC" sem="Z5" semo="|Z5|">'s</w>
        <w tag="VHZ" hw="have" pos="VERB" sem="Z5" semo="|Z5|A9u|A2:2|S4|">has</w>
        <w tag="XX" hw="not" pos="ADV" sem="Z6" semo="|Z6|">n't</w>
        <w tag="PPHS1" hw="he" pos="PRON" sem="Z8" semo="|Z8m|">he</w>
        <w tag="YQUE" hw="PUNC" pos="STOP" sem="" semo="|">?</w>
      </u>

Trivially, I know I can query for single words using:

for $w in //w
where $w = "houses"
return $w

OR

for $w in //w//text()
where $w = "houses"
return $w

But I cannot, for the life of me, figure out how I could query for more than a single term. I.e. "There were three houses". This would involve checking that each word is neighbouring and is not in a separate u tag. Ideally, I would be able to grab a few words before and after as well. I assume so far that this is difficult because of the structure but searching the plain file takes > 6 seconds and BaseX seems to be very efficient with this.

Any help is appreciated!

Upvotes: 1

Views: 58

Answers (1)

Michael Kay
Michael Kay

Reputation: 163595

With XQuery 1.0 you can do something like

for $x at $p in w 
where string-join(subsequence(w, $p, 4), ' ') = "There were three houses"
return ...

With XQuery 3.0 (or 3.1) you can use the new "for sliding window" clause, but I don't think it makes the answer any simpler than the above.

Upvotes: 0

Related Questions