John
John

Reputation: 4006

Correct way to get the last value for a field in Apache Spark or Databricks Using SQL (Correct behavior of last and last_value)?

What is the correct behavior of the last and last_value functions in Apache Spark/Databricks SQL. The way I'm reading the documentation (here: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html) it sounds like it should return the last value of what ever is in the expression.

So if I have a select statement that does something like

select 
  person,
  last(team)
from
  (select * from person_team order by date_joined)
group by person

I should get the last team a person joined, yes/no?

The actual query I'm running is shown below. It is returning a different number each time I execute the query.

select count(distinct patient_id) from (
  select
    patient_id,
    org_patient_id,
    last_value(data_lot) data_lot
  from
    (select * from my_table order by data_lot)
  where 1=1
    and org = 'my_org'
  group by 1,2
  order by 1,2
)
where data_lot in ('2021-01','2021-02')
;

What is the correct way to get the last value for a given field (for either the team example or my specific example)?

--- EDIT -------------------

I'm thinking collect_set might be useful here, but I get the error shown when I try to run this:

select
  patient_id,
  last_value(collect_set(data_lot)) data_lot
from
  covid.demo
group by patient_id
;

enter image description here

Error in SQL statement: AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;; Aggregate [patient_id#89338], [patient_id#89338, last_value(collect_set(data_lot#89342, 0, 0), false) AS data_lot#91848] +- SubqueryAlias spark_catalog.covid.demo

The posts shown below discusses how to get max values (not the same as last in a list ordered by a different field, I want the last team a player joined, the player may have joined the Reds, the A's, the Zebras, and the Yankees, in that order timewise, I'm looking for the Yankees) and these posts get to the solution procedurally using python/r. I'd like to do this in SQL.

Getting last value of group in Spark

Find maximum row per group in Spark DataFrame

--- SECOND EDIT -------------------

I ended up using something like this based upon the accepted answer.

  select 
    row_number() over (order by provided_date, data_lot) as row_num,
    demo.*
  from demo

Upvotes: 0

Views: 1301

Answers (2)

Sasan Ahmadi
Sasan Ahmadi

Reputation: 697

The last() function is non-deterministic according to the documentation, meaning it returns what it sees and it could be anything anytime you runt the function.

The behaviour your are expecting is the last value based on a logical order!

The order unfortunately doesn't come from the virtual table. Technically a set cannot have order and it is pretty confusing when spark does allow such behaviour in the virtual table or CTE. You are querying a set and you should always think of it as not ordered even though you introduce order by clause!

In order to enforce a deterministic behaviour you should combine last() with a window to define what order you are expecting and based of that it deterministically returns the last value of the item in the ordered window.

So the correct form which guarantees a correct result is something like this:

select 
  person,
  last(team) over (partition by person order by date_joined)
from
  (select * from person_team order by date_joined)
group by person

Here we are defining a window function that is per person orders the data based on date_joined and then last() always knows what to pick.

Upvotes: 0

mck
mck

Reputation: 42352

You can assign row numbers based on an ordering on data_lots if you want to get its last value:

select count(distinct patient_id) from (
    select * from (
        select *,
            row_number() over (partition by patient_id, org_patient_id, org order by data_lots desc) as rn
        from my_table
        where org = 'my_org'
    )
    where rn = 1
) 
where data_lot in ('2021-01','2021-02');

Upvotes: 1

Related Questions