Reputation: 1039
I only have 3 events (3rd column) 01, 02, 03 in my file.
the schema is unixTimestamp|id|eventType|date1|date2|date3
639393604950|1001|01|2015-05-12 10:00:18|||
639393604950|1002|01|2015-05-12 10:04:18|||
639393604950|1003|01|2015-05-12 10:05:18|||
639393604950|1001|02||2015-05-12 10:40:18||
639393604950|1001|03|||2015-05-12 19:30:18|
639393604950|1002|02|2015-05-12 10:04:18|||
in sqlContext, how do I merge the data by ID? I am expecting this for id 1001:
639393604950|1001|01|2015-05-12 10:00:18|2015-05-12 10:40:18|2015-05-12 19:30:18|
Here's my query that needs to be adjusted:
val events = sqlContext.sql("SELECT id, max(date1), max(date2), max(date3) " +
"FROM parquetFile group by id, date1, date2, date3")
events.collect().foreach(println)
Upvotes: 0
Views: 1116
Reputation: 1257
The way data is generated, it looks like schema in file is confusing. The problem is all dates are populated in date1 field, with different event types. Hence, we need to fix it.
select id, ts, max(d1),max(d2),max(d3)
from (select id, ts,
case when eventtype='01' then date1 else null end d1,
case when eventtype='02' then date1 else null end d2,
case when eventtype='03' then date1 else null end d3
from table1
) x group by id,ts
of course, this groups id and ts together, as expected in the answer.
Upvotes: 0
Reputation: 9569
SELECT id, max(date1), max(date2), max(date3) FROM parquetFile group by id
Upvotes: 2