Reputation: 1831
log1
event1 foo_id1
event1 foo_id2
event1 foo_id3
event1 foo_id4
event1 foo_id1
event1 foo_id1
event1 foo_id2
event1 foo_id1
event1 foo_id1
event1 foo_id3
Log2
event2 foo_id1 od_id1
event2 foo_id1 od_id13
event2 foo_id1 od_id15
event2 foo_id2 od_id2
event2 foo_id2 od_id14
event2 foo_id3 od_id3
event2 foo_id5 od_id9
event2 foo_id8 od_id10
event2 foo_id7 od_id11
event2 foo_id6 od_id12
event2 foo_id1 od_id4
event2 foo_id3 od_id5
event3 foo_id1 od_id6
event3 foo_id2 od_id7
event3 foo_id3 od_id8
Expected Output
event2 od_id1 1
event2 od_id2 1
event2 od_id3 1
event2 od_id4 1
event2 od_id5 1
event2 0d_id14 1
event2 od_id13 1
event2 od_id15 1
All events(ie, event1, event2, event3) have foo_ids. For all events in "event1", I would like to know how many times those foo_id(s) occur in event2 group by od_id.
Upvotes: 1
Views: 100
Reputation: 6693
All you need is an "in clause" which you can use left semi join in Hive instead.
In your situation, you would create two tables:
create table log1 (id string, fooid string);
create table log2 (id string, fooid string, odid string);
and load your data into these tables.
The SQL command you need is
select id, odid, count(fooid)
from log2 left semi join log1 on (log1.fooid = log2.fooid)
where id = 'event2'
group by id, odid;
Upvotes: 2