Reputation: 13
I have a database table containing unique user ids and items clicked.
e.g.
user id,item id
1 , 345
1 , 78993
1 , 784
5, 345
5, 897
15, 454
and I want to transform this data into following format using spark SQL (if possible in Scala)
user id, item ids
1, 345, 78993, 784
5, 345,897
15, 454
Thanks,
Upvotes: 1
Views: 1015
Reputation: 10428
A local example:
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
object Main extends App {
case class Record(user: Int, item: Int)
val items = List(
Record(1 , 345),
Record(1 , 78993),
Record(1 , 784),
Record(5, 345),
Record(5, 897),
Record(15, 454)
)
val sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
import hiveContext.sql
val df = sc.parallelize(items).toDF()
df.registerTempTable("records")
sql("SELECT * FROM records").collect().foreach(println)
sql("SELECT user, collect_set(item) From records group by user").collect().foreach(println)
}
This produces:
[1,ArrayBuffer(78993, 784, 345)]
[5,ArrayBuffer(897, 345)]
[15,ArrayBuffer(454)]
Upvotes: 1
Reputation: 67075
This is a pretty simple groupByKey
scenario. Although if you want to do something else with it after, then I would suggest using a more efficient PairRDDFunction
as groupByKey
is inefficient for follow up queries.
Upvotes: 0