Reputation: 662
I have some troubles with groupByKey
in scala and Spark.
I have 2 case classes :
case class Employee(id_employee: Long, name_emp: String, salary: String)
For the moment I use this 2nd case class:
case class Company(id_company: Long, employee:Seq[Employee])
However, I want to replace it with this new one:
case class Company(id_company: Long, name_comp: String employee:Seq[Employee])
There is a parent DataSet (df1) that I use with groupByKey
to create Company
objects :
val companies = df1.groupByKey(v => v.id_company)
.mapGroups(
{
case(k,iter) => Company(k, iter.map(x => Employee(x.id_employee, x.name_emp, x.salary)).toSeq)
}
).collect()
This code works, it returns objects like this one :
Company(1234,List(Employee(0987, John, 30000),Employee(4567, Bob, 50000)))
But I don't find the tip to add the Company name_comp to those objects (this field exist df1). In order to retrieve objects like this (using the new case class):
Company(1234, NYTimes, List(Employee(0987, John, 30000),Employee(4567, Bob, 50000)))
Upvotes: 1
Views: 608
Reputation: 28332
Since you want both the company id and name, what you can do is to use a tuple as the key when you group your data. This will make both values easily available when constructing the Company
class:
df1.groupByKey(v => (v.id_company, v.name_comp))
.mapGroups{ case((id, name), iter) =>
Company(id, name, iter.map(x => Employee(x.id_employee, x.name_emp, x.salary)).toSeq)}
.collect()
Upvotes: 2