Cast_A_Way
Cast_A_Way

Reputation: 472

how to write java udf in impala

I want to implement running total example in Impala like below :

DATE        |   DAY |   COUNT   |   Total
2014-11-12  |   1   |   12      |   12
2014-11-13  |   2   |   32      |   44
2014-11-14  |   3   |   50      |   94

To solve this problem using Impala, we might need temp variable like in MySql, which Impala does not support.

For this, I want to use java udf to add first row total with second row count. So that I will get second row total. How could I do this. Please suggest. Please also provide steps to add jar into Impala. Your help will be very helpful.

Upvotes: 2

Views: 1030

Answers (1)

Matt
Matt

Reputation: 4334

If you're using Impala 2.0 you can compute a running total using analytic (window) functions.

For example,

select date, day, count, sum(count) over (order by date) from your_table

See the Impala documentation for more information about analytic functions.

Update:

It is not possible to implement cumulative sum in the way that you want using a UDA, this is why 'analytic/window functions' exist. Why? A UDA is used to write custom functions for aggregations which group input rows by some expression and apply the aggregate fn to all the rows in the same group, but only a single output row is produced for the group. There are two problems with this:

  1. You want an output row for every input row where the cumulative sum is updated on every row. Consider the following queries with aggregations:

    sum(x) from mytable; 10

    age, count(*) from students group by age; 14, 1 15, 10 16, 11

    Notice how the aggregation 'collapses' a number of rows and evaluates the function (e.g. sum or count) over the entire set.

  2. You need the rows to be ordered on the date when computing the cumulative sum. As I mentioned, an aggregation applies the aggregate fn to a set of rows that have the same grouping expression, but they can be applied in any order. Analytic functions allow you to apply a function to a set of rows in a specified order, which is what you would need to compute the cumulative sum you want.

That said, in this case you can use a self join to produce the cumulative sum that you want. For example:

> select t1.id, sum(t2.id) as csum from tbl t1 inner join tbl t2 on t1.id >= t2.id group by t1.id order by t1.id;
+----+------------+
| id | csum       |
+----+------------+
| 0  | 0          |
| 1  | 1          |
| 2  | 3          |
| 3  | 6          |
| 4  | 10         |
| 5  | 15         |
| 6  | 21         |
| 7  | 28         |
+----+------------+

However, this could be expensive to compute and is harder to express in SQL. Ideally you can just upgrade to Impala 2.0 and use analytic functions.

Upvotes: 1

Related Questions