Barry NL
Barry NL

Reputation: 963

Why does this federated SPARQL query work in TopBraid but not in Apache Fuseki?

I have the following federated SPARQL query that works as I expect in TopBraid Composer Free Edition (version 5.1.4) but does not work in Apache Fuseki (version 2.3.1):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?s WHERE {
    SERVICE <http://data.linkedmdb.org/sparql> {
        <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
        ?actor movie:actor_name ?actorName .
    }
    SERVICE <http://dbpedia.org/sparql?timeout=30000> {
        ?s ?p ?o .
        FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
    }
}

I monitor the sub SPARQL queries that are being executed under the hood and notice that TopBraid correctly executes the following query to the http://dbpedia.org/sparql endpoint:

SELECT  *
WHERE
  { ?s ?p ?o
    FILTER regex(str(?s), replace("Paul Reubens", " ", "_"))
  }

while Apache Fuseki executes the following sub query:

 SELECT  *
WHERE
  { ?s  ?p  ?o
    FILTER regex(str(?s), replace(?actorName, " ", "_"))
  }

Notice the difference; TopBraid replace the variable ?actorName with a particular value 'Paul Reubens', while Apache Fuseki does not. This results in an error from the http://dbpedia.org/sparql endpoint because the ?actorName is used in the result set but not assigned.

Is this a bug in Apache Fuseki or a feature in TopBraid? How can I make Apache Fuseki correctly execute this Federated query.

update 1: to clarify the behaviour difference between TopBraid and Apache Fuseki a bit more. TopBraid executes the linkedmdb.org subquery first and then executes the dbpedia.org subquery for each result of the linkedmdb.org query )(and substitutes the ?actorName with the results from the linkedmdb.org query). I assumed Apache Fuseki behaves similar, but the first subquery to dbpedia.org fails (because ?actorName is used in the result set but not assigned) and so it does not continue. But now I am not sure if it actually want to execute the subquery to dbpedia.org multiple times, because it never gets there.

update 2: I think both TopBraid and Apache Fuseki use Jena/ARQ, but I noticed that in stack traces from TopBraid the package name is something like com.topbraid.jena.* which might indicate they use a modified version of Jena/ARQ?

update 3: Joshua Taylor says below: "Surely you wouldn't expect the second service block to be executed for each one of them?". Both TopBraid and Apache Fuseki use exactly this method for the following query:

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?film ?label ?subject WHERE {
    SERVICE <http://data.linkedmdb.org/sparql> {
        ?film a movie:film .
        ?film rdfs:label ?label .
        ?film owl:sameAs ?dbpediaLink 
        FILTER(regex(str(?dbpediaLink), "dbpedia", "i"))
    }
    SERVICE <http://dbpedia.org/sparql> {
        ?dbpediaLink dcterms:subject ?subject
    }
}
LIMIT 50

but I agree that in principle they should execute both parts once and join them, but maybe for performance reasons they chose a different strategy?

Additionally, notice how the above query works on Apache Fuseki, while the first query of this post does not. So, Apache Fuseki is actually behaving similarly to TopBraid in this particular case. It seems to be related to using an URI variable (?dbpediaLink) in two triple patterns (which works in Fuseki) compared to using a String variable (?actorName) from a triple pattern in a FILTER regex function (which does not work in Fuseki).

Upvotes: 2

Views: 701

Answers (2)

AndyS
AndyS

Reputation: 16680

(long comment)

Consider:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?s WHERE {
    {
        <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
        ?actor movie:actor_name ?actorName .
    }
    {
        ?s ?p ?o .
        FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
    }
}

that is the same query but with no SERVICE calls. ?actorName is not in a pattern of the inner second {}.

As join is a commutative operation, this has the same answers as the first query.

SELECT ?s WHERE {
    {
        ?s ?p ?o .
        FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
    }
    {
        <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
        ?actor movie:actor_name ?actorName .
    }
}

The SERVICE version highlights this because the parts are executes separately on different machines.

The join of the two parts happens on the results of each part.

Upvotes: 2

Joshua Taylor
Joshua Taylor

Reputation: 85883

Updated (Simpler) Response

In the original answer I wrote (below), I said that the issue was that SPARQL queries are executed innermost first. I think that that still applies here, but I think the problem can be isolated even more easily. If you have

service <ex1> { ... }
service <ex2> { ... }

then the results have to be what you'd get from executing each query separately on the endpoints and then joining the results. The join will merge any results where the common variables have the same values. E.g.,

service <ex1> { values ?a { 1 2 3 } }
service <ex2> { values ?a { 2 3 4 } }

would execute, and you'd have two possible values for ?a in the outer query (2 and 3). In your query, the second service can't produce any results. If you take:

?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .

and execute it at DBpedia, you shouldn't get any results, because ?actorName isn't bound, so the filter will never succeed. It appears that TopBraid is performing the first service first and then injecting the resulting values into your second service. That's convenient, but I don't think it's correct, because it returns different results than what you'd get if the DBpedia query had been executed first and the other query executed second.

Original Answer

Subqueries in SPARQL are executed inner-most first. That means that a query like

select * {
  { select ?x { ?x a :Cat } }
  ?x foaf:name ?name
}

Would first find all the cats, and would then find their names. "Candidate" values for ?x are determined first by the subquery, and then those values for ?x are made available to the outer query. Now, when there are two subqueries, e.g.,

select * {
  { select ?x { ?x a :Cat } }
  { select ?x ?name { ?x foaf:name ?name } }
}

the first subquery is going to find all the cats. The second subquery finds all the names of everything that has a name, and then in the outer query, the results are joined to get just the names of the cats. The values of ?x from the first subquery aren't available during the execution of the second subquery. (At least in principle, a query optimizer might be able to figure out that some things should be restricted.)

My understanding is that service blocks have the same kind of semantics. In your query, you have:

SERVICE <http://data.linkedmdb.org/sparql> {
    <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
    ?actor movie:actor_name ?actorName .
}
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
    ?s ?p ?o .
    FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}

You say that tracing shows that TopBraid is executing

SELECT  *
WHERE
  { ?s ?p ?o
    FILTER regex(str(?s), replace("Paul Reubens", " ", "_"))
  }

If TopBraid already executed the first service block and got a unique solution, then that might be an acceptable optimization, but what if, for instance, the first query had returned multiple bindings for ?actorName? Surely you wouldn't expect the second service block to be executed for each one of them? Instead, the second service block is executed as written, and will return a result set that will be joined with the result set from the first.

The reason that it probably "doesn't work" in Jena is because the second query doesn't actually bind any variables, so it's pretty much got to look at every triple in the data, which is obviously going to take a long time.

I think that you can get around this by nesting the service calls. If nested service are all launched by the "local" endpoint (i.e., nesting a service call doesn't ask a remote endpoint to make another remote query), then you might be able to do:

SERVICE <http://dbpedia.org/sparql?timeout=30000> {
    SERVICE <http://data.linkedmdb.org/sparql> {
      <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
      ?actor movie:actor_name ?actorName .
    }
    ?s ?p ?o .
    FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}

That might get you the kind of optimization that you want, but that still seems like it might not work unless DBpedia has some efficient ways of figuring out which triples to retrieve based on computing the replace. You're asking DBpedia to look at all its triples, and then to keep the ones where the string form of the subject matches a particular regular expression. It'd probably be better to construct that IRI manually in a subquery and then search for it. I.e.,

SERVICE <http://dbpedia.org/sparql?timeout=30000> {
  { select ?actor {
      SERVICE <http://data.linkedmdb.org/sparql> {
        <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor . 
        ?actor movie:actor_name ?actorName .
      }
      bind(iri(concat("http://dbpedia.org/resource",
                      replace(?actorName," ","_")))
           as ?actor)
    } } 
  ?actor ?p ?o 
}

Upvotes: 3

Related Questions