Reputation: 1531
I am running an aggregation query against wiki data. The query tries to calculate the average duration of films, grouped by their genre and the year of publication
The multiple grouping/subqueries in the query are intended to retain an n-1
relationship from film to the grouping criteria (year and genre) and a 1-1
relationship between a film and its duration. Reason for this is having approximately correct aggregations (n-1 relationships are familiar for OLAP and data warehousing practitioners).
More explanation is embedded in the query. Hence I cannot drop down the groupings done in the subqueries and the if statements or the group concatenation. This query times out on Wikidata SPARQL endpoint.
QUESTION
I need some suggestion for performance enhancement... Any optimization hints? In case that's not possible, anyone aware of some authenticated way (so that they know I am not playing around) to query Wikidata so that timeout can be increased, or a way to increase timeout generally?
# Average duration of films, grouped by their genre and the year of publication
SELECT
?genre1 # film genre
?year1 # film year of publication
(AVG(?duration1) AS ?avg) # film average duration
WHERE
{
# Calculating the average duration for each single film.
# As there are films with multiple duration, these durations are
# averagred by grouping aggregating durations by film.
# Hence, a single duration for each film is projected out from the subquery.
{
select ?film (avg(?duration) as ?duration1)
where{
?film <http://www.wikidata.org/prop/direct/P2047> ?duration .
}group by ?film
}
# Here the grouping criteria (genre and year) are calculated.
# The criteria is grouped by film, so that in case multiple
# genre/multiple year exist for a single film, all of them are
# group concated into a single value.
# Also in case of a lack of a value of year or genre for some
# specific film, a dummy value "OtherYear"/"OtherGenre" is generated.
{
select ?film (
IF
(
group_concat(distinct ?year ; separator="-- ") != "",
# In case multiple year exist for a single film, all of them are group concated into a single value.
group_concat(distinct ?year ; separator="-- "),
# In case of a lack of a value of year for some specific film, a dummy value "OtherYear" is generated.
"OtherYear"
)
as ?year1
)
(
IF
(
group_concat(distinct ?genre ; separator="-- ") != "",
# In case multiple genre exist for a single film, all of them are group concated into a single value.
group_concat(distinct ?genre ; separator="-- "),
# In case of a lack of a value of genre for some specific film, a dummy value "OtherGenre" is generated.
"OtherGenre"
)
as ?genre1
)
where
{
?film <http://www.wikidata.org/prop/direct/P31> <http://www.wikidata.org/entity/Q11424> .
optional {
?film <http://www.wikidata.org/prop/direct/P577> ?date .
BIND(year(?date) AS ?year)
}
optional {
?film <http://www.wikidata.org/prop/direct/P136> ?genre .
}
} group by ?film
}
} GROUP BY ?year1 ?genre1
Upvotes: 1
Views: 96
Reputation: 9472
The query seems to work after replacing the two IF
expressions with a simple sample
(which picks an arbitrary value from the group):
(sample(?year) as ?year1)
(sample(?genre) as ?genre1)
So it appears that the expense of group_concat
is the main problem. I don't find that very intuitive and have no explanation.
Maybe the version with sample
is good enough, or at least it may give you a baseline point for further improvements.
Upvotes: 1