Coderzz
Coderzz

Reputation: 55

Aggregating data where conditions are stored in another table

I have two tables. I for each date and category pair in the table2 I want to calculate the past week and two-week record counts for that category calculated from that day.

table1

| DATE         |Category |
|--------------|---------|
|2018-10-01    |ABC1     |
|2018-10-03    |ABC1     |
|2018-10-05    |ABC2     |
|2018-10-07    |ABC1     |
|2018-10-08    |ABC1     |
|2018-10-11    |ABC2     |
|2018-10-16    |ABC1     |
|2018-10-19    |ABC2     |
|2018-10-20    |ABC1     |
|2018-10-22    |ABC2     |
|2018-10-30    |ABC1     |

table2

| Category     |DATE           |
|--------------|---------------|
|ABC1          |2018-10-30     |
|ABC2          |2018-10-24     |
|ABC1          |2018-10-23     |
|ABC2          |2018-10-21     |

Final result should look something like this

| Category     |DATE           |past_week  | past_2_weeks  |
|--------------|---------------|-----------|---------------|
|ABC1          |2018-10-30     |1          |3              |
|ABC2          |2018-10-24     |1          |1              |
|ABC1          |2018-10-23     |2          |2              |
|ABC2          |2018-10-21     |1          |1              |

I know this can be achieved in SQL using correlated inline subqueries, but Hive doesn't have that capability. Is there an optimized way of doing this?

NOTE The dataset is very large, for example, table1 has more than 500000 rows and table2 have about 20000 rows. There are many other categories available as well (~160). I only want to show you how the tables look like.

Upvotes: 1

Views: 36

Answers (1)

Chema
Chema

Reputation: 2828

You can do subqueries in Hive as well.

In your example I join two subqueries.

WITH t AS(
SELECT t2.category AS category, t2.date AS date, COUNT(*) AS past_week
FROM table_dos t2
JOIN table_uno t1 ON(t2.category = t1.category)
WHERE t2.date >= t1.date AND
      CEIL(DATEDIFF(to_date(t2.date),to_date(t1.date)) / 7) <= 1
GROUP BY t2.category, t2.date)
SELECT t.category AS category, t.date AS date ,t.past_week AS past_week, t2.past_2_weeks AS past_2_weeks
FROM t
JOIN (SELECT t2.category AS category, t2.date AS date, COUNT(*) AS past_2_weeks
FROM table_dos t2
JOIN table_uno t1 ON(t2.category = t1.category)
WHERE t2.date >= t1.date AND
      CEIL(DATEDIFF(to_date(t2.date),to_date(t1.date)) / 7) <= 2
GROUP BY t2.category, t2.date) AS t2
ON(t.category = t2.category)
WHERE t.date = t2.date
ORDER BY category, date DESC;

expected output with the given data

+-----------+-------------+------------+---------------+--+
| category  |    date     | past_week  | past_2_weeks  |
+-----------+-------------+------------+---------------+--+
| ABC1      | 2018-10-30  | 1          | 3             |
| ABC1      | 2018-10-23  | 2          | 2             |
| ABC2      | 2018-10-24  | 2          | 3             |
| ABC2      | 2018-10-21  | 1          | 2             |
+-----------+-------------+------------+---------------+--+

Upvotes: 1

Related Questions