Esteban Zimanyi
Esteban Zimanyi

Reputation: 201

Aggregation + Rank in SPARQL -> Cumulative Percent?

A very elegant way to compute ranking into SPARQL has beeen suggested by Joshua Taylor in the answer to the following question How to rank values in SPARQL?

However, I need the Aggregation + Ranking as a first step to obtain the cumulative percent in queries such as "Countries that account for top 50% of the sales amount."

Suppose that we have sales at state level

California 30
Arizona 25
...
Alberta 25
Quebec 20
...

I would need to first aggregate by country and obtain the rank of the country as follows

United States 250 1
Canada 200 2
Mexico 150 3
...

How to write Aggregation + Rank in SPARQL generalizing Joshua's answer mentioned above ?

For the next step I would need to do the cumulative sum using the rank as follows

United States 250 1 250
Canada 200 2 450
Mexico 150 3 600
...

For writing this in SPARQL I guess I can use a SQL trick as reported here how to get cumulative sum

If the above succeeded, then we can try the cumulative percentage. Suppose now that the total sales for all countries is 1000. Then the result would be

United States 250  250
Canada 200 450

For writing this in SPARQL we can use something that in SQL would be like follows

SELECT CountryName, SalesAmount
FROM CumSalesCountry
WHERE CumSalesAmount <=
(SELECT MIN(CumSalesAmount) FROM CumSalesCountry
WHERE CumSalesAmount >=
(SELECT 0.5 * SUM(SalesAmount) FROM SalesCountry) ) )

Any help on this would be much appreciated

Esteban (struggling with writing BI-like queries in SPARQL ...)

Upvotes: 1

Views: 761

Answers (2)

Joshua Taylor
Joshua Taylor

Reputation: 85823

Suppose you've got data like this:

@prefix : <urn:ex:> .

:a1 a :A ; :v 05 .
:a2 a :A ; :v 10 .

:b1 a :B ; :v 10 .
:b2 a :B ; :v 10 .
:b3 a :B ; :v 05 .

:c1 a :C ; :v 10 .

Then you can use a query like this:

prefix : <urn:ex:>

select
   ?type
   (?value*100/?total as ?percent)
   (count(?type2) as ?rank)
   (sum(?value2)*100/?total as ?cumulativePercent)
where { 
  #-- total value across all types 
  { select (sum(?value) as ?total) 
    where { ?x :v ?value } }

  #-- each type and its sum value as ?type and ?value
  { select ?type (sum(?v) as ?value)
    where { ?x a ?type ; :v ?v }
    group by ?type }

  #-- each type and its sum value as ?type2 and ?value2
  { select ?type2 (sum(?v) as ?value2)
    where { ?x a ?type2 ; :v ?v }
    group by ?type2 }

  filter ( ?value2 >= ?value )
}
group by ?type ?value ?total
order by desc(?percent)

to get results like these:

---------------------------------------------
| type | percent | rank | cumulativePercent |
=============================================
| :B   | 50.0    | 1    | 50.0              |
| :A   | 30.0    | 2    | 80.0              |
| :C   | 20.0    | 3    | 100.0             |
---------------------------------------------

Now, there is a problem here, if tow of the types have the same percentages. E.g., if you add the data

:d1 a :D ; :v 07 .
:d2 a :D ; :v 08 .

then you get the results:

---------------------------------------------------------------------------
| type | percent                     | rank | cumulativePercent           |
===========================================================================
| :B   | 38.461538461538461538461538 | 1    | 38.461538461538461538461538 |
| :A   | 23.076923076923076923076923 | 3    | 84.615384615384615384615384 |
| :D   | 23.076923076923076923076923 | 3    | 84.615384615384615384615384 |
| :C   | 15.384615384615384615384615 | 4    | 99.999999999999999999999999 |
---------------------------------------------------------------------------

That shows that A and D are tied, and gives them the same cumulative percentage and rank. If that's not what you want, you can add something to the filter condition, e.g.,

filter (  ?value2 > ?value
       || ( ?value2 = ?value &&
            str(?type2) >= str(?type) ))

to resolve the identical cases and get results like:

---------------------------------------------------------------------------
| type | percent                     | rank | cumulativePercent           |
===========================================================================
| :B   | 38.461538461538461538461538 | 1    | 38.461538461538461538461538 |
| :D   | 23.076923076923076923076923 | 2    | 61.538461538461538461538461 |
| :A   | 23.076923076923076923076923 | 3    | 84.615384615384615384615384 |
| :C   | 15.384615384615384615384615 | 4    | 99.999999999999999999999999 |
---------------------------------------------------------------------------

Upvotes: 3

AndyS
AndyS

Reputation: 16630

SPARQL support subqueries which can be used for calculation of aggregates as part of an overall query:

SELECT * {
  { SELECT (MIN(?CumSalesAmount) AS ?minSalesAmount) 
    {  ... get ?CumSalesAmount ... } }
  { SELECT (0.5*sum(?SalesAmount) AS ?aggSalesAmount) 
     {  ... get ?SalesAmount ... } }
  ... get ?CountryName ?SalesAmount ...
}

The effects are not identical - it's a way of approaching the problem, not a translation clone of your example.

Upvotes: 1

Related Questions