Bryant
Bryant

Reputation: 85

(SQL) How to compare each row to all other rows in Presto

I'm working with AWS Athena which uses Presto. Let's say I have a SQL table with columns A, B, C, and D. Assume table is sorted by column C, ascending.

I need to compare each row to all the other rows and check if current row's D value is the maximum value out of all rows whose C values are less than current row's C value. Then append a boolean value in column F. Code in Python would look something like:

D_val_list = []
for index, row in df.iterrows():
    max_val_D = df[:index]['D'].max() #Sorted on column C
    if row['D'] < max_val_D:
        D_val_list.append(FALSE)
    else:
        D_val_list.append(TRUE)
df['F'] = D_val_list

Using the provisional jupyter notebook in Athena times out (the dataset is millions of rows long) and I figure connecting to AWS via local jupyter instance would have similar issues.

Upvotes: 0

Views: 1534

Answers (2)

saifuddin778
saifuddin778

Reputation: 7298

You have to discretely order your rows on c in Athena because of its distributed nature. You can use window functions on top of the ordered set to achieve your desired results:

SELECT 
     a,
     b,
     c,
     d,   
     CASE WHEN d>lag(max_so_far) OVER () THEN true ELSE false END as f
FROM (
    SELECT a,
        b,
        c,
        d,
        max(d) OVER (rows BETWEEN unbounded preceding AND current row) AS max_so_far
    FROM ( 
      -- sorted ON c 
      SELECT 
        a,
        b,
        c,
        d
      FROM dataset.table
      ORDER BY  c 
    ) 
 )

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1271191

In SQL, you would use window functions -- something like this:

select t.*,
       (case when d < coalesce(max(d) over (order by c
                                            rows between unbounded preceding and 1 preceding) is null,
                               d + 1
             then 1 else 0
        end) as flag          
from t;

This logic would work assuming that c is unique. That said, there might be alternative depending on the exact nature of the data.

Upvotes: 1

Related Questions