Why this SPARQL query times out and how to optimize this query?

Question

I have this SPARQL query that I ran through Wikidata's endpoint

SELECT ?bLabel ?b ?hLabel ?a ?cLabel
  WHERE
  {
    wd:Q11462 ?a ?b.
    wd:Q11095 ?a ?b.
    ?c ?a ?b.
    ?h wikibase:directClaim ?a .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
  }

Essentially, I'm looking for relationships that is shared by wd:Q11462 and wd:Q11095 and see what else shares the relationship. It hits the 60 seconds time limit.

However, if I run multiple queries in two parts :

First, obtain the shared relationships

SELECT ?bLabel ?b ?hLabel ?a
  WHERE
  {
    wd:Q11462 ?a ?b.
    wd:Q11095 ?a ?b.
    ?h wikibase:directClaim ?a .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
  }

And then, for each obtained relationship, run a query that find what else shares it with them.

"""
SELECT ?cLabel 
  WHERE
  {
    ?c wdt:P131 wd:Q3586.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
  }
  """

The entire queries run only for 2.5 seconds.

Due to constraints, I wish to able to reach that same speed with only a single query. What should I do?

Karl Amort · Accepted Answer

Here is a an approach that uses a subquery. It takes six seconds:

 SELECT ?cLabel 
  WITH {
    SELECT ?bLabel ?b ?hLabel ?a
    WHERE {
      wd:Q11462 ?a ?b.
      wd:Q11095 ?a ?b.
      ?h wikibase:directClaim ?a .
     }
  } as %results
  WHERE {
    INCLUDE %results.
    ?c wdt:P131 wd:Q3586.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
  }

Subqueries are the natural extension given the stark difference you observed and how close they are conceptually to your approach of running multiple queries consecutively. A more generic trick that often helps is replacing the label service with a manual query for labels.

After switching to some items with fewer (common) statements, I convinced the query service to explain itself. I can't quite claim to understand that output, but as far as I can tell it's the label service that's throwing it off (Row 5 in the table at the bottom):

9             com.bigdata.bop.BOp.bopId 
CONTROLLER    com.bigdata.bop.BOp.evaluationContext
false         com.bigdata.bop.PipelineOp.pipelined
true          com.bigdata.bop.PipelineOp.sharedState
ServiceNode   com.bigdata.bop.controller.ServiceCallJoin.serviceNode
wdq           com.bigdata.bop.controller.ServiceCallJoin.namespace
1596209250127 com.bigdata.bop.controller.ServiceCallJoin.timestamp
[b, h, c]     com.bigdata.bop.join.HashJoinAnnotations.joinVars
null          com.bigdata.bop.join.JoinAnnotations.constraints

It seems as if it tries to populate labels for 20000+ items at that point. Apart from just leaving it out of the first query, SPARQL offers the ability to add hints as to the ideal sequence of operations, which might be useful here.

Why this SPARQL query times out and how to optimize this query?

Answers (1)

Related Questions