Reputation: 2033
I have two tables in Hive, t1
and t2
>describe t1;
>date_id string
>describe t2;
>messageid string,
createddate string,
userid int
> select * from t1 limit 3;
> 2011-01-01 00:00:00
2011-01-02 00:00:00
2011-01-03 00:00:00
> select * from t2 limit 3;
87211389 2011-01-03 23:57:01 13864753
87211656 2011-01-03 23:57:59 13864769
87211746 2011-01-03 23:58:25 13864785
What I want is to count previous three-day distinct userid for a given date.
For example, for date 2011-01-03
, I want to count distinct userid from 2011-01-01
to 2011-01-03
.
for date 2011-01-04
, I want to count distinct userid from 2011-01-02
to 2011-01-04
I wrote the following query. But it does not return three-day result. It returns distinct userid per day instead.
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
`to_date()` and `date_sub()` are date function in Hive.
That said, the following part does not take effect.
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
EDIT: One solution can be (but it is super slow):
SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
SELECT * FROM t1 LEFT OUTER JOIN t2
WHERE
(date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
)
) t3
GROUP by to_date(t3.date_id);
UPDATE: Thanks for all answers. They are good.
But Hive is a bit different from SQL. Unfortunately, they cannot use in HIVE.
My current solution is to use UNION ALL
.
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)
Then, I do group by
and count
. In this way, I can get what I want.
Although it is not elegant, it is much efficient than cross join
.
Upvotes: 4
Views: 24934
Reputation: 86798
The following should seem to work in standard SQL...
SELECT
to_date(t1.date_id),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
AND to_date(t2.createddate) < date_add(to_date(t1.date_id), 1)
GROUP BY
to_date(t1.date_id)
It will, however, be slow. Because you are storing dates as strings, the using to_date() to convert them to dates. What this means is that indexes can't be used, and the SQL engine can't do Anything clever to reduce the effort being expended.
As a result, every possible combination of rows needs to be compared. If you have 100 entries in T1 and 10,000 entries in T2, your SQL engine is processing a million combinations.
If you store these values as dates, you don't need to_date()
. And if you index the dates, the SQL engine can quickly home in on the range of dates being specified.
NOTE: The format of the ON
clause means that you do not need to round t2.createddate
down to a daily value.
EDIT Why your code didn't work...
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
This joins t1 to t2 with an ON
clause of (to_date(t2.createddate) = to_date(t1.date_id))
. As the join is a LEFT OUTER JOIN, the values in t2.createddate
MUST now either be NULL (no matches) or be the same as t1.date_id
.
The WHERE
clause allows a much wider range (3 days). But the ON
clause of the JOIN
has already restricted you data down to a single day.
The example I gave above simply takes your WHERE
clause and put's it in place of the old ON
clause.
EDIT
Hive doesn't allow <=
and >=
in the ON clause? Are you really fixed in to using HIVE???
If you really are, what about BETWEEN?
SELECT
to_date(t1.date_id),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
to_date(t1.date_id)
Alternatively, refactor your table of dates to enumerate the dates you want to include...
TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
2011-01-03, 2011-01-02
2011-01-03, 2011-01-03
2011-01-04, 2011-01-02
2011-01-04, 2011-01-03
2011-01-04, 2011-01-04
2011-01-05, 2011-01-03
2011-01-05, 2011-01-04
2011-01-05, 2011-01-05 }
SELECT
to_date(t1.calendar_date),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
to_date(t1.calendar_date)
Upvotes: 11
Reputation: 1851
I am making an assumption that t1 is used to define the 3 day period. I suspect the puzzling approach is due to Hive's shortcomings. This allows you to have an arbitrary number of 3 day periods. Try the following 2 queries
SELECT substring(t1.date_id,1,10), count(distinct t2.userid)
FROM t1
JOIN t2
ON substring(t2.createddate,1,10) >= date_sub(substring(t1.date_id,1,10), 2)
AND substring(t2.createddate,1,10) <= substring(t1.date_id,1,10)
GROUP BY t1.date_id
--or--
SELECT substring(t1.date_id,1,10), count(distinct t2.userid)
FROM t1
JOIN t2
ON t2.createddate like substring(t1.date_id ,1,10) + '%'
OR t2.createddate like substring(date_sub(t1.date_id, 1) ,1,10) + '%'
OR t2.createddate like substring(date_sub(t1.date_id, 2) ,1,10) + '%'
GROUP BY t1.date_id
The latter minimizes the function calls on the t2 table. I am also assuming that t1 is the smaller of the 2. substring should return the same result as to_date. According to the documentation, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions, to_date returns a string data type. Support for date data types seems minimal but I am not familiar with hive.
Upvotes: 2
Reputation: 5594
You need a subquery:
try something like this (i cannot test because i don't have hive)
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE t2.messageid in
(
select t2.messageid from t2 where
date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND
to_date(t2.createddate) <= to_date(t1.date_id)
)
GROUP by to_date(t1.date_id);
the key is that with subquery FOR EACH date in t1, the right records are selected in t2.
EDIT:
Forcing subquery in from clause you could try this:
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN
(select userid, createddate from t2 where
date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND
to_date(t2.createddate) <= to_date(t1.date_id)
) as t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
GROUP by to_date(t1.date_id);
but don't know if could work.
Upvotes: 3