Reputation: 85
I'm working with AWS Athena which uses Presto. Let's say I have a SQL table with columns A, B, C, and D. Assume table is sorted by column C, ascending.
I need to compare each row to all the other rows and check if current row's D value is the maximum value out of all rows whose C values are less than current row's C value. Then append a boolean value in column F. Code in Python would look something like:
D_val_list = []
for index, row in df.iterrows():
max_val_D = df[:index]['D'].max() #Sorted on column C
if row['D'] < max_val_D:
D_val_list.append(FALSE)
else:
D_val_list.append(TRUE)
df['F'] = D_val_list
Using the provisional jupyter notebook in Athena times out (the dataset is millions of rows long) and I figure connecting to AWS via local jupyter instance would have similar issues.
Upvotes: 0
Views: 1534
Reputation: 7298
You have to discretely order your rows on c
in Athena because of its distributed nature. You can use window functions on top of the ordered set to achieve your desired results:
SELECT
a,
b,
c,
d,
CASE WHEN d>lag(max_so_far) OVER () THEN true ELSE false END as f
FROM (
SELECT a,
b,
c,
d,
max(d) OVER (rows BETWEEN unbounded preceding AND current row) AS max_so_far
FROM (
-- sorted ON c
SELECT
a,
b,
c,
d
FROM dataset.table
ORDER BY c
)
)
Upvotes: 0
Reputation: 1271191
In SQL, you would use window functions -- something like this:
select t.*,
(case when d < coalesce(max(d) over (order by c
rows between unbounded preceding and 1 preceding) is null,
d + 1
then 1 else 0
end) as flag
from t;
This logic would work assuming that c
is unique. That said, there might be alternative depending on the exact nature of the data.
Upvotes: 1