Median Hilal
Median Hilal

Reputation: 1531

Optimizing aggregation query against wikidata

I am running an aggregation query against wiki data. The query tries to calculate the average duration of films, grouped by their genre and the year of publication

The multiple grouping/subqueries in the query are intended to retain an n-1 relationship from film to the grouping criteria (year and genre) and a 1-1 relationship between a film and its duration. Reason for this is having approximately correct aggregations (n-1 relationships are familiar for OLAP and data warehousing practitioners).

More explanation is embedded in the query. Hence I cannot drop down the groupings done in the subqueries and the if statements or the group concatenation. This query times out on Wikidata SPARQL endpoint.

QUESTION

I need some suggestion for performance enhancement... Any optimization hints? In case that's not possible, anyone aware of some authenticated way (so that they know I am not playing around) to query Wikidata so that timeout can be increased, or a way to increase timeout generally?

    # Average duration of films, grouped by their genre and the year of publication       
SELECT  
        ?genre1                    # film genre
        ?year1                     # film year of publication
        (AVG(?duration1) AS ?avg)   # film average duration

WHERE
        {      
            # Calculating the average duration for each single film.
            # As there are films with multiple duration, these durations are 
            # averagred by grouping aggregating durations by film.
            # Hence, a single duration for each film is projected out from the subquery.
            {
              select ?film (avg(?duration) as ?duration1)  
              where{
                ?film   <http://www.wikidata.org/prop/direct/P2047>   ?duration .    
              }group by ?film
            }

            # Here the grouping criteria (genre and year) are calculated.
            # The criteria is grouped by film, so that in case multiple 
            # genre/multiple year exist for a single film, all of them are
            # group concated into a single value.
            # Also in case of a lack of a value of year or genre for some
            # specific film, a dummy value "OtherYear"/"OtherGenre" is generated.
            {
              select ?film (
                                IF
                                (
                                    group_concat(distinct ?year ; separator="-- ") != "", 
                                    # In case multiple year exist for a single film, all of them are group concated into a single value.
                                    group_concat(distinct ?year ; separator="-- "), 
                                   # In case of a lack of a value of year for some specific film, a dummy value "OtherYear" is generated.
                                    "OtherYear"                                        
                                )
                                as ?year1
                              )
                                (
                                IF
                                (
                                    group_concat(distinct ?genre ; separator="-- ") != "",
                                    # In case multiple genre exist for a single film, all of them are group concated into a single value.
                                    group_concat(distinct ?genre ; separator="-- "), 
                                    # In case of a lack of a value of genre for some specific film, a dummy value "OtherGenre" is generated.
                                    "OtherGenre"  
                                )
                                as ?genre1
                              ) 

              where 
              {
                ?film  <http://www.wikidata.org/prop/direct/P31>  <http://www.wikidata.org/entity/Q11424> .
                 optional {
                   ?film   <http://www.wikidata.org/prop/direct/P577>  ?date .
                   BIND(year(?date) AS ?year)
                 }
                 optional {
                   ?film <http://www.wikidata.org/prop/direct/P136>  ?genre .
                 }
              } group by ?film              
          }

        } GROUP BY ?year1 ?genre1

Upvotes: 1

Views: 96

Answers (1)

cygri
cygri

Reputation: 9472

The query seems to work after replacing the two IF expressions with a simple sample (which picks an arbitrary value from the group):

    (sample(?year) as ?year1)
    (sample(?genre) as ?genre1) 

So it appears that the expense of group_concat is the main problem. I don't find that very intuitive and have no explanation.

Maybe the version with sample is good enough, or at least it may give you a baseline point for further improvements.

Upvotes: 1

Related Questions