Reputation: 57

Calculating percentile in data bricks

Can anyone please help and tell where is the error? What Am I doing wrong? (Databricks)

even the example from databricks www doesn't work and produce the same error like below. Is there any other method to calculate this metric?

select

        customerid,
        yearid,
        monthid,
        sum(TotalSpendings) as TotalSpendings,
        sum(TotalQuantity) as TotalQuantity,
        count (distinct ticketid) as TotalTickets,
        AVG(AvgIndexesPerTicket) as AvgIndexesPerTicket,
        max (transactiondate) as DateOfLastVisit,
        count(distinct transactiondate) as TotalNumberOfVisits,
        AVG(TotalSpendings) as AverageTicket,
        sum(TotalQuantity)/count(distinct ticketid) as AvgQttyPerTicket,
        sum(TotalDiscount) as TotalDiscount,
        percentile_disc(0.25) WITHIN GROUP (ORDER BY TotalQuantity), 
        percentile_disc(0.50) WITHIN GROUP (ORDER BY TotalQuantity),
        percentile_disc(0.75) WITHIN GROUP (ORDER BY TotalQuantity) as PercentileQttyTicket_75,
        percentile_disc(0.90) WITHIN GROUP (ORDER BY TotalQuantity) as PercentileQttyTicket_90,
        
        percentile_disc(0.25) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_25,
        percentile_disc(0.50) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_50,
        percentile_disc(0.75) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_75,
        percentile_disc(0.90) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_90
        
from (

select
        a.customerid,
        a.ticketid,
        a.transactiondate,
        extract(year from a.transactiondate) as yearid,
        extract(month from a.transactiondate) as monthid,
        sum(positionvalue) as TotalSpendings,
        sum(quantity) as TotalQuantity,   
        count(distinct productindex)/count(distinct a.ticketid) as AvgIndexesPerTicket,
        sum(discountvalue) as TotalDiscount
        from default.TICKET_ITEM a
          

        where 1=1

        and a.transactiondate between '2022-10-01' and '2022-10-31'
        and a.transactiontype = 'S'
        and a.transactiontypeheader = 'S'
        and a.customerid in ('94861b2c83c54d03930af4585a3a325a')
        and length(a.customerid) > 10
        group by 1,2,3,4,5) DETAL
        
        group by 1,2,3"""

I still receive error:

ParseException: no viable alternative at input 'GROUP ('(line 15, pos 43)

Upvotes: 0

Answers (2)

kshah

Reputation: 1

Why are you selecting customerid in your select statement? I believe you want to look at for each months, what is total spending, what is total transaction, 25th percentile of spending etc? For that you should not group it by customerid and instead try below for percentile percentile(TotalQuantity, 0.25) as PercentileQty_25

Upvotes: 0

CRAFTY DBA

Reputation: 14915

Try reducing the complexity of the problem until you figure out what is wrong. Unless I have you TICKET_ITEM hive table, I can not try debugging the issue in my environment. Many times I break a complex query into pieces.

First, always put data into a schema (database) for management.

%sql
create database STACK_OVER_FLOW

Thus, your table would be recreated as STACK_OVER_FLOW.TICKET_ITEM.

Second, place the inner query into a permanent or temporary view. The code below creates a permanent view in the new schema.

%sql
create view STACK_OVER_FLOW.FILTERED_TICKET_ITEM as
select
    a.customerid,
    a.ticketid,
    a.transactiondate,
    extract(year from a.transactiondate) as yearid,
    extract(month from a.transactiondate) as monthid,
    sum(a.positionvalue) as TotalSpendings,
    sum(a.quantity) as TotalQuantity,   
    count(distinct a.productindex) / count(distinct a.ticketid) as AvgIndexesPerTicket,
    sum(discountvalue) as TotalDiscount
from 
    STACK_OVER_FLOW.TICKET_ITEM a
where 
    1=1
    and a.transactiondate between '2022-10-01' and '2022-10-31'
    and a.transactiontype = 'S'
    and a.transactiontypeheader = 'S'
    and a.customerid in ('94861b2c83c54d03930af4585a3a325a')
    and length(a.customerid) > 10
group by 
    customerid,
    ticketid,
    transactiondate,
    yearid,
    monthid

Third, always group by or order by name, not by position. You might the field order over time. I did notice extra """ at the end of the query but it might be a typo.

At this point you will know if the inner query works correctly in the view and you can focus on the outer query with the percentiles.

In data engineering, I have seen the spark optimizer get confused when the number of temporary views is large. In these cases, the intermediate view might have to be written to file as a step. Then you can expose that file as a view and continue with your engineering effort.

The percentile_disc is part of the databricks distribution.

https://docs.databricks.com/sql/language-manual/functions/percentile_disc.html

It is not a core function that is part of the open source distribution.

https://spark.apache.org/docs/latest/api/sql/index.html#percentile

Please add more information to the post after you reduce the complexity and still can not find your issue.

Upvotes: 1

Calculating percentile in data bricks

Answers (2)

Related Questions