Reputation: 1816

SPARQL query running too slow, when querying, sometimes timesout, is there a way to optimize the query?

My Wikidata Query Service query sometimes takes 35000 milliseconds or 35 seconds to complete. I'm not very adept at SPARQL. The query below does work (except it does duplicates sometimes). I'm wanting to get "famous people" by providing a birthday date and month, of which I get their name, birthday, an image of them (Wikimedia), and occupations. I'm also filtering by persons with birth places in the USA and UK.

I've added a variable called "sitelinks" that I count how many links are dedicated to them as a popularity metric (I'm open to better ideas on this, if there is a better way to gauge popularity). Is there a way to make this more optimized? Again, the query works, it's just slow.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?person ?personLabel ?birthdate ?countryLabel (COUNT(DISTINCT(?sitelink)) as ?sites) (GROUP_CONCAT(DISTINCT ?occupationLabel; separator=", ") as ?occupations) (SAMPLE(?image) as ?uniqueImage)
WHERE {
  ?person wdt:P31 wd:Q5 ;  # Instance of human
      wdt:P569 ?birthdate ;  # Date of birth
      wdt:P27 ?country ;  # Citizenship
      wdt:P106 ?occupation ;  # Occupation
      wdt:P18 ?uniqueImage .  # Image
     
  ?country rdfs:label ?countryLabel .
  ?occupation rdfs:label ?occupationLabel .
  ?sitelink schema:about ?person .
  FILTER(LANG(?countryLabel) = "en")
  FILTER(LANG(?occupationLabel) = "en")

  FILTER(MONTH(?birthdate) = 5 && DAY(?birthdate) = 20)
  FILTER(?country IN (wd:Q30, wd:Q145))

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?person ?personLabel ?birthdate ?countryLabel ?uniqueImage ORDER BY DESC(?sites)
LIMIT 50

If anyone wants to paste the query to the Wikidata Query Service, here is the link https://query.wikidata.org/

Upvotes: 1

Answers (3)

Valerio Cocchi

Reputation: 1966

Interesting question!

It's true that it's unlikely that the query can be optimised much more than what logi-kal did. However you could look at the number of social media followers that a user has wdt:P8687 as a proxy for their popularity, however imperfect.

This would yield a query like this, which takes me 7-8s to run:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?person ?personLabel ?birthdate ?countryLabel ?followers ?uniqueImage
WHERE {
  
  VALUES ?country {wd:Q30 wd:Q145}
  ?person wdt:P31 wd:Q5 ;  # Instance of human
      wdt:P569 ?birthdate ;  # Date of birth
      wdt:P27 ?country ;  # Citizenship
      wdt:P8687 ?followers; # Social Media Followers 
      wdt:P18 ?uniqueImage .  # Image

  FILTER(CONTAINS(STR(?birthdate), "-05-20"))
 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?followers)
LIMIT 50

Notice that Louis Theroux is British and American, hence why he appears twice. You can do a GROUP_CONCAT to fix the issue.

You could also consider pulling some of the data and run it in a local database that you host, whether a SPARQL database or some other type. That way you'd be able to add any metric or combination thereof to the data.

Upvotes: 2

PCDSandwichMan

Reputation: 2120

I think it's the site count that's killing your query and I think I resolved the duplicate record issue. Would something like this work for your use-case?

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT 
?person 
?personLabel 
?birthdate 
?countryLabel 
?website
(GROUP_CONCAT(DISTINCT ?occupationLabel; separator=", ") AS ?occupations) 
(MIN(?image) AS ?uniqueImage)
WHERE {
  ?person wdt:P31 wd:Q5 ;  # Instance of human
          wdt:P569 ?birthdate ;  # Date of birth
          wdt:P27 ?country .  # Citizenship
          
  OPTIONAL { ?person wdt:P856 ?website }  # Official website
  
  ?country rdfs:label ?countryLabel .
  FILTER(?country IN (wd:Q30, wd:Q145))
  FILTER(LANG(?countryLabel) = "en")
  
  FILTER(DATATYPE(?birthdate) = xsd:dateTime 
        && MONTH(?birthdate) = 5 
        && DAY(?birthdate) = 20)
   
  OPTIONAL { 
    ?person wdt:P106 ?occupation .  # Occupation
    ?occupation rdfs:label ?occupationLabel .
    FILTER(LANG(?occupationLabel) = "en")
  }
  
  OPTIONAL { ?person wdt:P18 ?image }  # Image

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?person ?personLabel ?birthdate ?countryLabel ?website
ORDER BY DESC(?website)
LIMIT 50

Edit

If you do need the site count though you might be able to use something like this

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT 
?person 
?personLabel 
?birthdate 
?countryLabel 
(COUNT(DISTINCT ?sitelink) AS ?sites) 
(GROUP_CONCAT(DISTINCT ?occupationLabel; separator=", ") AS ?occupations) 
(MIN(?image) AS ?uniqueImage)
WHERE {
  ?person wdt:P31 wd:Q5 ;  # Instance of human
          wdt:P569 ?birthdate ;  # Date of birth
          wdt:P27 ?country .  # Citizenship
          
  OPTIONAL { ?person wdt:P18 ?image }  # Image

  ?sitelink schema:about ?person .
  
  ?country rdfs:label ?countryLabel .
  FILTER(?country IN (wd:Q30, wd:Q145))
  FILTER(LANG(?countryLabel) = "en")
  
  FILTER(DATATYPE(?birthdate) = xsd:dateTime 
        && MONTH(?birthdate) = 5 
        && DAY(?birthdate) = 20)
   
  OPTIONAL { 
    ?person wdt:P106 ?occupation .  # Occupation
    ?occupation rdfs:label ?occupationLabel .
    FILTER(LANG(?occupationLabel) = "en")
  }
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?person ?personLabel ?birthdate ?countryLabel
ORDER BY DESC(?sites)
LIMIT 50

Upvotes: 3

logi-kal

Reputation: 7880

A couple of little improvements:

You don't need do use rdfs:label if you already use SERVICE wikibase:label {...}, except if you want to force the English label to be specified in the item or if you want to use the variables in the SELECT part.
Instead of filtering a posteriori the values for ?country with FILTER(...), you can use VALUES ?country { wd:Q30 wd:Q145 }.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?person ?personLabel ?birthdate ?countryLabel (COUNT(DISTINCT(?sitelink)) as ?sites) (GROUP_CONCAT(DISTINCT ?occupationLabel; separator=", ") as ?occupations) (SAMPLE(?image) as ?uniqueImage)
WHERE {
  ?person wdt:P31 wd:Q5 ;  # Instance of human
          wdt:P569 ?birthdate ;  # Date of birth
          wdt:P27 ?country ;  # Citizenship
          wdt:P106 ?occupation ;  # Occupation
          wdt:P18 ?uniqueImage .  # Image
  
  ?sitelink schema:about ?person .
  ?occupation rdfs:label ?occupationLabel .
  FILTER(LANG(?occupationLabel) = "en")
  FILTER(MONTH(?birthdate) = 5 && DAY(?birthdate) = 20)
  VALUES ?country { wd:Q30 wd:Q145 }
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,[AUTO_LANGUAGE]". }
}
GROUP BY ?person ?personLabel ?birthdate ?countryLabel ?uniqueImage ORDER BY DESC(?sites)
LIMIT 50

I don't think you can do much better, by experience I can tell you that schema:about property is very resource-demanding.

Upvotes: 2

SPARQL query running too slow, when querying, sometimes timesout, is there a way to optimize the query?

Answers (3)

Edit

Related Questions