Reputation: 1430
I am trying to write a SPARQL query that will extract all relevant triples from a triplestore, using Construct. Essentially, the triplestore is containing a bunch of JSON-LD documents that got parsed into triples, so there is a predictable set of verbs and pattern, and my goal is to reconstruct one of these documents by getting the relevant triples. The documents were JSON objects roughly 7 nested objects deep, and the structure is generally known but any leaf object may have unknown properties I want to get back. So one way I can go about this is:
CONSTRUCT WHERE
{
# get top level object
?subject <:knownProperty1> ?v1 .
?subject <:knownProperty2> ?v2 .
?subject <:knownProperty3> ?v3 .
# leaf subobjects should get all their fields included
?v1 ?v1_p ?v1_o .
?v2 ?v2_p ?v2_o .
?v3 ?v3_p ?v3_o .
# v3 has these nested objects.
?v3 <:knownNest1> ?n1 .
?n1 ?n1_p ?n1_o .
# n2 is the next level of nesting
?n1 <:knownNest2> ?n2 .
?n2 ?n2_p ?n2_o .
#... and so on
}
This produces a set of triples that is orders of magnitude larger than the actual document due to duplication -- it is correct but it creates "a graph" for every possible combinatorial match of these values; especially because each level of nesting may have multiple (an array of) subobjects. It gets hairier because many of these known fields are also optional. So for example all the graph matches which assign one concrete value per variable, that include ?subject <:knownProperty1> <:value1>
, supply one copy of that triple, resulting in it being included 100s-1000s of times. In my simple test case that I am using to iterate on, there are 106 triples in the input, and fully specifying the allowed structure as shown above results in a CONSTRUCT result set of 5.5 MILLION triples with a query latency (in RAM) of over 60 seconds.
I can handle writing a complex query but I believe this is a code smell given that the basic problem is not that complicated. So my question is:
or any other suggestions about the proper way to try this. Thank you!
Upvotes: 4
Views: 244
Reputation: 132
I use the following pattern and process to write construct queries like this.
SELECT * WHERE
{
{
# get top level object
} UNION {
# leaf subobjects should get all their fields included
} UNION {
# v3 has these nested objects.
} UNION {
# n2 is the next level of nesting
} UNION {
#... and so on
}
}
Now you can run the query and verify the output. If all is ok write the Replace the 'SELECT *' with your CONSTRUCT body. Imagine your CONSTRUCT template get's called for every line of the table form your SELECT query.
CONSTRUCT {
# get top level object triples
.... use the variables from UNION block 1
# leaf subobjects should get all their fields included
.... use the variables from UNION block 2
# v3 has these nested objects.
.... use the variables from UNION block 3
# n2 is the next level of nesting
.... use the variables from UNION block 4
# and so on ....
}
WHERE
{
{
# get top level object
} UNION {
# leaf subobjects should get all their fields included
} UNION {
# v3 has these nested objects.
} UNION {
# n2 is the next level of nesting
} UNION {
#... and so on
}
}
This approach fits well form me.
Cons: This approach sometimes leads to 'repeat yourself' in the different UNION blocks.
Upvotes: 2
Reputation: 16700
Whether there are duplicate triples (as opposed to unexpected triples genrated by the pattern which are not in the data graph) will depend on the triplestore. It is a trade-off of returning a set (one occurrence of each triple) with scalability for large results (keeping the set of triples stop giving streaming results).
A complicated CONSTRUCT can do this by controlling the pattern:
CONSTRUCT { ... }
WHERE {
SELECT DISTINCT vars needed by template {
...
}
}
That looses the ability to use CONSTRUCT WHERE so the template is written twice. CONSTRUCT WHERE is only a convenient short form.
The full CONSTRUCT can have OPTIONAL in the pattern part
Upvotes: 1