Reputation: 4006
What is the correct behavior of the last and last_value functions in Apache Spark/Databricks SQL. The way I'm reading the documentation (here: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html) it sounds like it should return the last value of what ever is in the expression.
So if I have a select statement that does something like
select
person,
last(team)
from
(select * from person_team order by date_joined)
group by person
I should get the last team a person joined, yes/no?
The actual query I'm running is shown below. It is returning a different number each time I execute the query.
select count(distinct patient_id) from (
select
patient_id,
org_patient_id,
last_value(data_lot) data_lot
from
(select * from my_table order by data_lot)
where 1=1
and org = 'my_org'
group by 1,2
order by 1,2
)
where data_lot in ('2021-01','2021-02')
;
What is the correct way to get the last value for a given field (for either the team example or my specific example)?
--- EDIT -------------------
I'm thinking collect_set might be useful here, but I get the error shown when I try to run this:
select
patient_id,
last_value(collect_set(data_lot)) data_lot
from
covid.demo
group by patient_id
;
Error in SQL statement: AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;; Aggregate [patient_id#89338], [patient_id#89338, last_value(collect_set(data_lot#89342, 0, 0), false) AS data_lot#91848] +- SubqueryAlias spark_catalog.covid.demo
The posts shown below discusses how to get max values (not the same as last in a list ordered by a different field, I want the last team a player joined, the player may have joined the Reds, the A's, the Zebras, and the Yankees, in that order timewise, I'm looking for the Yankees) and these posts get to the solution procedurally using python/r. I'd like to do this in SQL.
Getting last value of group in Spark
Find maximum row per group in Spark DataFrame
--- SECOND EDIT -------------------
I ended up using something like this based upon the accepted answer.
select
row_number() over (order by provided_date, data_lot) as row_num,
demo.*
from demo
Upvotes: 0
Views: 1301
Reputation: 697
The last()
function is non-deterministic according to the documentation, meaning it returns what it sees and it could be anything anytime you runt the function.
The behaviour your are expecting is the last value based on a logical order!
The order unfortunately doesn't come from the virtual table. Technically a set cannot have order and it is pretty confusing when spark does allow such behaviour in the virtual table or CTE. You are querying a set and you should always think of it as not ordered even though you introduce order by
clause!
In order to enforce a deterministic behaviour you should combine last()
with a window to define what order you are expecting and based of that it deterministically returns the last value of the item in the ordered window.
So the correct form which guarantees a correct result is something like this:
select
person,
last(team) over (partition by person order by date_joined)
from
(select * from person_team order by date_joined)
group by person
Here we are defining a window function that is per person orders the data based on date_joined
and then last()
always knows what to pick.
Upvotes: 0
Reputation: 42352
You can assign row numbers based on an ordering on data_lots
if you want to get its last value:
select count(distinct patient_id) from (
select * from (
select *,
row_number() over (partition by patient_id, org_patient_id, org order by data_lots desc) as rn
from my_table
where org = 'my_org'
)
where rn = 1
)
where data_lot in ('2021-01','2021-02');
Upvotes: 1