Reputation: 4055
I have one table with three columns id, name, and date. Each day I get a dump of all the names and ids from an external process. I want to produce an output table where for each day I get a count of the unique number of names that occurred on that day that I have not seen before.
For instance, given the following data
id name d
1 Bob 2014-01-01
2 Suzy 2014-01-01
3 Jen 2014-01-02
4 Jen 2014-01-02
5 Bob 2014-01-02
6 Jon 2014-01-03
7 Mike 2014-01-03
8 Suzy 2014-01-03
I want a query to return
d name_count
2014-01-02 1
2014-01-03 2
My first thought was to use a correlated subquery like below but Hive doesn't support that
select c.d, count(distinct lower(c.name)) as name_count
from db c
where lower(c.name) not in (
select lower(p.name) as name
from db p
where p.d < c.d
group by lower(p.company_name)
)
group by d
I also though about using a left outer join, but I don't think this will give me what I want
select c.d, count(distinct lower(c.name)) as name
from db c
left outer join db p on lower(c.name) = lower(p.name)
where p.name is null and p.d < c.d
group by c.d;
because p.name will never be null as I'm self joining.
Do you guys know if my second query will work? If not can you please suggest an alternative.
Thanks
Upvotes: 1
Views: 1475
Reputation: 3845
A slightly different approach but a nested query should solve this:
Query:
select q.d, COUNT(*) from (select name, min(d) as d from table_name GROUP BY name)q GROUP BY q.d
Output:
2014-01-01 2
2014-01-02 1
2014-01-03 2
Upvotes: 2
Reputation: 1
The second query won't work, the date compare shall be within the join, as following:
select c.date, count(distinct c.name) as count_name
from db c
left join db p
on lower(c.name) = lower(p.name)
and p.date < c.date
where p.name is null
group by c.date
You can test it on http://sqlfiddle.com/ using the below schema:
CREATE TABLE db
(
id int,
name varchar(255),
date varchar(20)
);
insert into db values(1,'Bob', '2014-01-01');
insert into db values(2,'Suzy', '2014-01-01');
insert into db values(3,'Jen', '2014-01-02');
insert into db values(4,'Jen', '2014-01-02');
insert into db values(5,'Bob', '2014-01-02');
insert into db values(6,'Jon', '2014-01-03');
insert into db values(7,'Mike', '2014-01-03');
insert into db values(8,'Suzy', '2014-01-03');
Upvotes: 0