Stefano
Stefano

Reputation: 85

Count pairs of distinct elements using Xquery

I would count all distinct items considering parent and child tag.

For example, if the document contains:

    <root>
      <A value="5.1">
        <B value="3.5">
          <C value="1.4">
            <D value="0.2">
              <E value="X"/>
            </D>
          </C>
        </B>
      </A>
      <A value="5.1">
        <B value="3.5">
          <C value="1.4">
            <D value="0.4">
              <E value="Y"/>
            </D>
          </C>
        </B>
      </A>
      <A value="4.6">
        <B value="3.1">
          <C value="1.5">
            <D value="0.2">
              <E value="X"/>
            </D>
          </C>
        </B>
      </A>
      <A value="5.0">
        <B value="3.6">
          <C value="1.4">
            <D value="0.2">
              <E value="X"/>
            </D>
          </C>
        </B>
      </A>
    </root>

I would count all distinct items for AB, ABC, ABCD, BCD etc.

I tried this xquery command, but got only the distinct values ​​for individual items:

count(distinct-values(doc('partitioncollection')/root/A/@value)) -> 3
count(distinct-values(doc('partitioncollection')/root/B/@value)) -> 3
count(distinct-values(doc('partitioncollection')/root/C/@value)) -> 2

More specifically, if I want calculate the counts of all possibile pairs of value, the results should show:

A/B : 3
A/B/C : 3
A/B/C/D : 4
A/B/C/D/E : 4
B/C : 3
B/C/D : 4
B/C/D/E : 4
C/D : 3
C/D/E : 3
D/E : 2

C/E : should be 3 because there are 3 distinct pairs of values: 

<C value="1.4"><E value="X"/>
<C value="1.4"><E value="Y"/>
<C value="1.5"><E value="X"/>

A/D : should be 4 because there are 4 distinct pairs of values: 

<A value="5.1"><D value="0.2">
<A value="5.1"><D value="0.4">
<A value="4.6"><D value="0.2">
<A value="5.0"><D value="0.2">
etc.

Being computationally complex, I believe it is easier to create a function that takes as a single set of tags (i.e. A-D or C-E etc.) and returns the value of the counter

Upvotes: 0

Views: 409

Answers (1)

Martin Honnen
Martin Honnen

Reputation: 167716

I am still not sure the problem is precisely described (it is not clear whether the structure is always A/B/C/D/E, it is not clear whether for deeper levels you only want to compare e.g. B/C if they are children of an A with the same value) but in general I think this can be solve using grouping and recursion; I came up with

declare function local:distinct-descendants($elements as element()*) as xs:string*
{
    for $element-group in $elements[*]
    group by $element-name := node-name($element-group)
    let $max-depth := max($element-group/count(.//*)) + 1
    for $level in 2 to $max-depth
      let $groups := 
        for $path-group in $element-group
        group by $group-path := string-join(subsequence($path-group/descendant-or-self::*/node-name(), 1, $level), '/')
        for $value-group in $path-group
        group by $value-seq := string-join(subsequence($value-group/descendant-or-self::*/@value, 1, $level), '|')
        return head($group-path)
      for $level-group in $groups
      group by $level-path := $level-group
      order by head($level)
      return $level-path || ' : ' || count($level-group)
    ,
    if ($elements/*) then local:distinct-descendants($elements/*) else ()
};

local:distinct-descendants(root/*)

At https://xqueryfiddle.liberty-development.net/nbUY4kz/4 this outputs

A/B : 3
A/B/C : 3
A/B/C/D : 4
A/B/C/D/E : 4
B/C : 3
B/C/D : 4
B/C/D/E : 4
C/D : 3
C/D/E : 3
D/E : 2

I get the same result with BaseX.

The code might be a bit too complicated if the structure is always A/B/C/D/E, in that case the first grouping on the node-name() is not needed.

I also had to group twice inside, on the "path" sequence (e.g. A/B/C) and on the @value sequence, there might be easier ways but I wasn't able to relate unique @value sequences with the the nesting structure without doing the duplicate grouping.

Perhaps

declare function local:distinct-descendants($elements as element()*) as xs:string*
{
    let $max-depth := max($elements/count(descendant-or-self::*))
    for $level in 2 to $max-depth
      let $groups := 
        for $path-group in $elements
        group by 
          $group-path := string-join(subsequence($path-group/descendant-or-self::*/node-name(), 1, $level), '/'),
          $value-seq := string-join(subsequence($path-group/descendant-or-self::*/@value, 1, $level), '|')
        return head($group-path)
      for $level-group in $groups
      group by $level-path := $level-group
      order by head($level)
      return $level-path || ' : ' || count($level-group)
    ,
    if ($elements/*) then local:distinct-descendants($elements/*) else ()
};

local:distinct-descendants(root/*)

at https://xqueryfiddle.liberty-development.net/nbUY4kz/6 is a bit simpler and still does the job.

I think the previous two attempts work fine as long as the subtrees all have one child but break if not; thus in the general case of an arbitrary number of child elements it seems necessary to compute paths and values from descendants and group them:

declare variable $value-separator as xs:string external := '|';

declare function local:distinct-descendants($elements as element()*) as xs:string*
{
    for $element-group in $elements[*]
    group by $element-name := node-name($element-group)
    return
        (
            let $groups :=
                for $descendant-group in $element-group//*
                group by
                    $path-key := string-join(($descendant-group/ancestor-or-self::* except $element-group/ancestor::*)/node-name(), '/'),
                    $value-key := string-join(($descendant-group/ancestor-or-self::* except $element-group/ancestor::*)/@value, $value-separator)
                return $path-key
            for $path-group in $groups
            group by $path-key := $path-group
            return $path-key || ' : ' || count($path-group)
            ,
            if ($element-group/*) then local:distinct-descendants($element-group/*) else ()
        )
};

local:distinct-descendants(root/*)

https://xqueryfiddle.liberty-development.net/nbUY4kz/14

Upvotes: 2

Related Questions