Jon
Jon

Reputation: 4055

Hive self join based on previous date

I have one table with three columns id, name, and date. Each day I get a dump of all the names and ids from an external process. I want to produce an output table where for each day I get a count of the unique number of names that occurred on that day that I have not seen before.

For instance, given the following data

id    name    d
1    Bob    2014-01-01
2    Suzy   2014-01-01
3    Jen    2014-01-02
4    Jen    2014-01-02
5    Bob    2014-01-02
6    Jon    2014-01-03
7    Mike   2014-01-03
8    Suzy   2014-01-03

I want a query to return

d             name_count
2014-01-02    1
2014-01-03    2

My first thought was to use a correlated subquery like below but Hive doesn't support that

select c.d, count(distinct lower(c.name)) as name_count
from db c
where lower(c.name) not in (
  select lower(p.name) as name
  from db p
  where p.d < c.d
  group by lower(p.company_name)
)
group by d

I also though about using a left outer join, but I don't think this will give me what I want

select c.d, count(distinct lower(c.name)) as name
from db c
left outer join db p on lower(c.name) = lower(p.name)
where p.name is null and p.d < c.d
group by c.d;

because p.name will never be null as I'm self joining.

Do you guys know if my second query will work? If not can you please suggest an alternative.

Thanks

Upvotes: 1

Views: 1475

Answers (2)

Amar
Amar

Reputation: 3845

A slightly different approach but a nested query should solve this:

Query:

select q.d, COUNT(*) from (select name, min(d) as d from table_name GROUP BY name)q GROUP BY q.d

Output:

2014-01-01  2
2014-01-02  1
2014-01-03  2

Upvotes: 2

Bar Oster
Bar Oster

Reputation: 1

The second query won't work, the date compare shall be within the join, as following:

select c.date, count(distinct c.name) as count_name
from db c
left join db p 
on lower(c.name) = lower(p.name)
and p.date < c.date
where p.name is null
group by c.date

You can test it on http://sqlfiddle.com/ using the below schema:

CREATE TABLE db
(
id int,
name varchar(255),
date varchar(20) 
);

insert into db values(1,'Bob', '2014-01-01');
insert into db values(2,'Suzy', '2014-01-01');
insert into db values(3,'Jen', '2014-01-02');
insert into db values(4,'Jen', '2014-01-02');
insert into db values(5,'Bob', '2014-01-02');
insert into db values(6,'Jon', '2014-01-03');
insert into db values(7,'Mike', '2014-01-03');
insert into db values(8,'Suzy', '2014-01-03');

Upvotes: 0

Related Questions