m.alcala
m.alcala

Reputation: 91

Comparing Current Row Value to All previous Row Value in HIVE SQL or TABLEAU

I wanted to compare the current row value to all previous value either in Tableau or SQL. Below is an example of what I wanted.

enter image description here

So like for colA Row2, I would like to compare it to Row1 if it is greater than. For colA Row 2, I would like to compare it with Row2 and Row3. This will go on and on until it reaches the end of the partition. Then the output is just counting how many scenarios were TRUE (as shown in the right side of the table).

Does anyone know any syntax on how to do this in Tableau or Hive SQL? I already explored the LAG() function in HIVE sql as well as WINDOW_SUM() function in Tableau.

I can opt to create a UDF in Hive but I am not that familiar with the concept nor with the syntaxes.

Thanks for the help.

Upvotes: 2

Views: 1540

Answers (3)

Alex Blakemore
Alex Blakemore

Reputation: 11919

When the order of data values affects a calculation, you have two choices as a Tableau client.

  1. Use custom SQL to write a query that depends on row order, such as with later SQL versions that support windowing queries, using the over and partition keywords. This approach performs the order-specific calculations on the database server.
  2. Use Tableau table calcs to write a calculation that traverses the (aggregate) query results to compute the desired value. Table calcs are the only one of the (at least 4) types of calculations in Tableau that can compare values between different rows. The other types of Tableau queries (record level calcs, aggregate calcs and LOD calcs) are very useful but cannot do anything that depends on the order of different rows.

Both approaches are useful. Both have quirks and complexities to get around. All things being equal, table calcs are more flexible so I'd try those first. Realize that in addition to the formula, a table calc is also specified by information about how to traverse the query result set - known as partitioning and addressing. See the online help for table calcs.

However, since table calcs are implemented client side, then they aren't the best choice when you have extremely large data sets that would otherwise not need to be fetched to the client. They are a good choice when the information driving your calculations has already been fetched client-side, say aggregate query results.

Upvotes: 0

Georgina Skibinski
Georgina Skibinski

Reputation: 13407

Try this one:

where gs_test is your input table with cola column

select 
    cola, 
    row_num, 
    sum(if(cola>next_el_, 1, 0)) as countOfTrueBefore 
from 
    (
    select 
        cola, 
        row_num, 
        collect_set(cola) over (order by row_num asc rows between 1 following and unbounded following) as next_el 
    from 
        (
        select 
            cola, 
            sum(1) over (rows between unbounded preceding and current row) as row_num 
        from 
            gs_test
        ) main_inner
    ) main 
lateral view outer explode(next_el) ne as next_el_ group by cola, row_num;

Result:

cola - original column

row_num - inverse row number (the first row is the last)

countOfTrueBefore - as per your logic

MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.91 sec   HDFS Read: 7690 HDFS Write: 278 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 7.2 sec   HDFS Read: 8314 HDFS Write: 285 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 5.34 sec   HDFS Read: 5586 HDFS Write: 63 SUCCESS
Total MapReduce CPU Time Spent: 19 seconds 450 msec
OK
0.01    5       0
0.33    1       1
0.47    6       0
0.48    2       2
0.75    4       2
0.85    7       0
0.88    3       4

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1271141

In order for your question to make sense, you need a column that specifies the ordering. I will assume the first columns is an id column with this infromation.

This is challenging in SQL in general. One method that should work in Hive is a cross join and aggregation.

select t.id, t.colA,
       sum(case when t2.id < t.id and t2.colA < t.colA then 1 else 0 end)
from t cross join
     t t2
group by t.id, t.colA;

Note: this will not have good performance on even a medium sized table.

Upvotes: 0

Related Questions