Reputation: 1816
My Wikidata Query Service query sometimes takes 35000 milliseconds or 35 seconds to complete. I'm not very adept at SPARQL. The query below does work (except it does duplicates sometimes). I'm wanting to get "famous people" by providing a birthday date and month, of which I get their name, birthday, an image of them (Wikimedia), and occupations. I'm also filtering by persons with birth places in the USA and UK.
I've added a variable called "sitelinks" that I count how many links are dedicated to them as a popularity metric (I'm open to better ideas on this, if there is a better way to gauge popularity). Is there a way to make this more optimized? Again, the query works, it's just slow.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?person ?personLabel ?birthdate ?countryLabel (COUNT(DISTINCT(?sitelink)) as ?sites) (GROUP_CONCAT(DISTINCT ?occupationLabel; separator=", ") as ?occupations) (SAMPLE(?image) as ?uniqueImage)
WHERE {
?person wdt:P31 wd:Q5 ; # Instance of human
wdt:P569 ?birthdate ; # Date of birth
wdt:P27 ?country ; # Citizenship
wdt:P106 ?occupation ; # Occupation
wdt:P18 ?uniqueImage . # Image
?country rdfs:label ?countryLabel .
?occupation rdfs:label ?occupationLabel .
?sitelink schema:about ?person .
FILTER(LANG(?countryLabel) = "en")
FILTER(LANG(?occupationLabel) = "en")
FILTER(MONTH(?birthdate) = 5 && DAY(?birthdate) = 20)
FILTER(?country IN (wd:Q30, wd:Q145))
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?person ?personLabel ?birthdate ?countryLabel ?uniqueImage ORDER BY DESC(?sites)
LIMIT 50
If anyone wants to paste the query to the Wikidata Query Service, here is the link https://query.wikidata.org/
Upvotes: 1
Views: 343
Reputation: 1966
Interesting question!
It's true that it's unlikely that the query can be optimised much more than what logi-kal
did. However you could look at the number of social media followers that a user has wdt:P8687
as a proxy for their popularity, however imperfect.
This would yield a query like this, which takes me 7-8s to run:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?person ?personLabel ?birthdate ?countryLabel ?followers ?uniqueImage
WHERE {
VALUES ?country {wd:Q30 wd:Q145}
?person wdt:P31 wd:Q5 ; # Instance of human
wdt:P569 ?birthdate ; # Date of birth
wdt:P27 ?country ; # Citizenship
wdt:P8687 ?followers; # Social Media Followers
wdt:P18 ?uniqueImage . # Image
FILTER(CONTAINS(STR(?birthdate), "-05-20"))
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?followers)
LIMIT 50
Notice that Louis Theroux is British and American, hence why he appears twice. You can do a GROUP_CONCAT
to fix the issue.
You could also consider pulling some of the data and run it in a local database that you host, whether a SPARQL database or some other type. That way you'd be able to add any metric or combination thereof to the data.
Upvotes: 2
Reputation: 2120
I think it's the site count that's killing your query and I think I resolved the duplicate record issue. Would something like this work for your use-case?
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT
?person
?personLabel
?birthdate
?countryLabel
?website
(GROUP_CONCAT(DISTINCT ?occupationLabel; separator=", ") AS ?occupations)
(MIN(?image) AS ?uniqueImage)
WHERE {
?person wdt:P31 wd:Q5 ; # Instance of human
wdt:P569 ?birthdate ; # Date of birth
wdt:P27 ?country . # Citizenship
OPTIONAL { ?person wdt:P856 ?website } # Official website
?country rdfs:label ?countryLabel .
FILTER(?country IN (wd:Q30, wd:Q145))
FILTER(LANG(?countryLabel) = "en")
FILTER(DATATYPE(?birthdate) = xsd:dateTime
&& MONTH(?birthdate) = 5
&& DAY(?birthdate) = 20)
OPTIONAL {
?person wdt:P106 ?occupation . # Occupation
?occupation rdfs:label ?occupationLabel .
FILTER(LANG(?occupationLabel) = "en")
}
OPTIONAL { ?person wdt:P18 ?image } # Image
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?person ?personLabel ?birthdate ?countryLabel ?website
ORDER BY DESC(?website)
LIMIT 50
If you do need the site count though you might be able to use something like this
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT
?person
?personLabel
?birthdate
?countryLabel
(COUNT(DISTINCT ?sitelink) AS ?sites)
(GROUP_CONCAT(DISTINCT ?occupationLabel; separator=", ") AS ?occupations)
(MIN(?image) AS ?uniqueImage)
WHERE {
?person wdt:P31 wd:Q5 ; # Instance of human
wdt:P569 ?birthdate ; # Date of birth
wdt:P27 ?country . # Citizenship
OPTIONAL { ?person wdt:P18 ?image } # Image
?sitelink schema:about ?person .
?country rdfs:label ?countryLabel .
FILTER(?country IN (wd:Q30, wd:Q145))
FILTER(LANG(?countryLabel) = "en")
FILTER(DATATYPE(?birthdate) = xsd:dateTime
&& MONTH(?birthdate) = 5
&& DAY(?birthdate) = 20)
OPTIONAL {
?person wdt:P106 ?occupation . # Occupation
?occupation rdfs:label ?occupationLabel .
FILTER(LANG(?occupationLabel) = "en")
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?person ?personLabel ?birthdate ?countryLabel
ORDER BY DESC(?sites)
LIMIT 50
Upvotes: 3
Reputation: 7880
A couple of little improvements:
rdfs:label
if you already use SERVICE wikibase:label {...}
, except if you want to force the English label to be specified in the item or if you want to use the variables in the SELECT
part.?country
with FILTER(...)
, you can use VALUES ?country { wd:Q30 wd:Q145 }
.PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?person ?personLabel ?birthdate ?countryLabel (COUNT(DISTINCT(?sitelink)) as ?sites) (GROUP_CONCAT(DISTINCT ?occupationLabel; separator=", ") as ?occupations) (SAMPLE(?image) as ?uniqueImage)
WHERE {
?person wdt:P31 wd:Q5 ; # Instance of human
wdt:P569 ?birthdate ; # Date of birth
wdt:P27 ?country ; # Citizenship
wdt:P106 ?occupation ; # Occupation
wdt:P18 ?uniqueImage . # Image
?sitelink schema:about ?person .
?occupation rdfs:label ?occupationLabel .
FILTER(LANG(?occupationLabel) = "en")
FILTER(MONTH(?birthdate) = 5 && DAY(?birthdate) = 20)
VALUES ?country { wd:Q30 wd:Q145 }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,[AUTO_LANGUAGE]". }
}
GROUP BY ?person ?personLabel ?birthdate ?countryLabel ?uniqueImage ORDER BY DESC(?sites)
LIMIT 50
I don't think you can do much better, by experience I can tell you that schema:about
property is very resource-demanding.
Upvotes: 2