qqq
qqq

Reputation: 1430

CONSTRUCT a DISTINCT set of triples by following paths in SPARQL

I am trying to write a SPARQL query that will extract all relevant triples from a triplestore, using Construct. Essentially, the triplestore is containing a bunch of JSON-LD documents that got parsed into triples, so there is a predictable set of verbs and pattern, and my goal is to reconstruct one of these documents by getting the relevant triples. The documents were JSON objects roughly 7 nested objects deep, and the structure is generally known but any leaf object may have unknown properties I want to get back. So one way I can go about this is:

CONSTRUCT WHERE
{
  # get top level object
  ?subject <:knownProperty1> ?v1 .
  ?subject <:knownProperty2> ?v2 .
  ?subject <:knownProperty3> ?v3 .

  # leaf subobjects should get all their fields included
  ?v1 ?v1_p ?v1_o .
  ?v2 ?v2_p ?v2_o .
  ?v3 ?v3_p ?v3_o .

  # v3 has these nested objects.
  ?v3 <:knownNest1> ?n1 .
  ?n1 ?n1_p ?n1_o .

  # n2 is the next level of nesting
  ?n1 <:knownNest2> ?n2 .
  ?n2 ?n2_p ?n2_o .

  #... and so on
}

This produces a set of triples that is orders of magnitude larger than the actual document due to duplication -- it is correct but it creates "a graph" for every possible combinatorial match of these values; especially because each level of nesting may have multiple (an array of) subobjects. It gets hairier because many of these known fields are also optional. So for example all the graph matches which assign one concrete value per variable, that include ?subject <:knownProperty1> <:value1>, supply one copy of that triple, resulting in it being included 100s-1000s of times. In my simple test case that I am using to iterate on, there are 106 triples in the input, and fully specifying the allowed structure as shown above results in a CONSTRUCT result set of 5.5 MILLION triples with a query latency (in RAM) of over 60 seconds.

I can handle writing a complex query but I believe this is a code smell given that the basic problem is not that complicated. So my question is:

  1. am I thinking about this wrong ? Is it in fact quite hard in sparql to write a query that would retrieve all the triples following certain paths?
  2. is there a convenient way to use SELECT DISTINCT subqueries to shorten this? All my attempts at this are equivalent to "select each distinct comprehensive match on this pattern", which is no better. I want distinct triples when the pattern matches are combined.

or any other suggestions about the proper way to try this. Thank you!

Upvotes: 4

Views: 244

Answers (2)

Benjamin Hofstetter
Benjamin Hofstetter

Reputation: 132

I use the following pattern and process to write construct queries like this.

  1. Start with a SELECT and a UNION for each level.
    • to be save don't reuse variable names in different UNION blocks.
SELECT * WHERE
{ 
  {
        # get top level object

  } UNION {
        # leaf subobjects should get all their fields included

  } UNION {
        # v3 has these nested objects.

  } UNION {
        # n2 is the next level of nesting

  } UNION {
        #... and so on

  }
}

Now you can run the query and verify the output. If all is ok write the Replace the 'SELECT *' with your CONSTRUCT body. Imagine your CONSTRUCT template get's called for every line of the table form your SELECT query.

CONSTRUCT {
  # get top level object triples
  .... use the variables from UNION block 1 

  # leaf subobjects should get all their fields included
  .... use the variables from UNION block 2
 
  # v3 has these nested objects.
  .... use the variables from UNION block 3

  # n2 is the next level of nesting  
  .... use the variables from UNION block 4

  # and so on ....

}
WHERE
{ 
  {
        # get top level object

  } UNION {
        # leaf subobjects should get all their fields included

  } UNION {
        # v3 has these nested objects.

  } UNION {
        # n2 is the next level of nesting

  } UNION {
        #... and so on

  }
}

This approach fits well form me.

  1. I avoid unwanted 'duplicates' because I don't mix variables from different union blocks.
  2. With starting with a SELECT I can focus on fetching the needed Triples and in the CONSTRUCT part I can focus on Building the graph.
  3. Separating different things in different UNIONS help me to debug.
  4. I can optimise the query performance of a union block if needed.

Cons: This approach sometimes leads to 'repeat yourself' in the different UNION blocks.

Upvotes: 2

AndyS
AndyS

Reputation: 16700

Whether there are duplicate triples (as opposed to unexpected triples genrated by the pattern which are not in the data graph) will depend on the triplestore. It is a trade-off of returning a set (one occurrence of each triple) with scalability for large results (keeping the set of triples stop giving streaming results).

A complicated CONSTRUCT can do this by controlling the pattern:

CONSTRUCT { ... }
WHERE {
   SELECT DISTINCT vars needed by template {
      ...
   }
}

That looses the ability to use CONSTRUCT WHERE so the template is written twice. CONSTRUCT WHERE is only a convenient short form.

The full CONSTRUCT can have OPTIONAL in the pattern part

Upvotes: 1

Related Questions