sophie
sophie

Reputation: 1039

Scala: Spark sqlContext query

I only have 3 events (3rd column) 01, 02, 03 in my file.

the schema is unixTimestamp|id|eventType|date1|date2|date3

639393604950|1001|01|2015-05-12 10:00:18|||
639393604950|1002|01|2015-05-12 10:04:18|||
639393604950|1003|01|2015-05-12 10:05:18|||
639393604950|1001|02||2015-05-12 10:40:18||
639393604950|1001|03|||2015-05-12 19:30:18|
639393604950|1002|02|2015-05-12 10:04:18|||

in sqlContext, how do I merge the data by ID? I am expecting this for id 1001:

639393604950|1001|01|2015-05-12 10:00:18|2015-05-12 10:40:18|2015-05-12 19:30:18|

Here's my query that needs to be adjusted:

val events = sqlContext.sql("SELECT id, max(date1), max(date2), max(date3) " +
  "FROM parquetFile group by id, date1, date2, date3")
events.collect().foreach(println)

Upvotes: 0

Views: 1116

Answers (2)

ayan guha
ayan guha

Reputation: 1257

The way data is generated, it looks like schema in file is confusing. The problem is all dates are populated in date1 field, with different event types. Hence, we need to fix it.

select id, ts, max(d1),max(d2),max(d3)
   from (select id, ts,
                case when eventtype='01' then date1 else null end d1,
                case when eventtype='02' then date1 else null end d2,
                case when eventtype='03' then date1 else null end d3
             from table1
         ) x group by id,ts

of course, this groups id and ts together, as expected in the answer.

Upvotes: 0

kostya
kostya

Reputation: 9569

SELECT id, max(date1), max(date2), max(date3) FROM parquetFile group by id

Upvotes: 2

Related Questions