Reputation: 8638
I have this DataFrame df
with 3 columns: id
, type
and activity
.
val myData = (Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "hy"),("aa2", "GROUP_B", "14"),
("aa3","GROUP_B", "11"),("aa3","GROUP_B","12" ),("aa2", "GROUP_3", "12"))
val df = sc.parallelize(myData).toDF()
I need to group data by type
and then calculate the number of activities for each id
. This is the expected result:
type id count
GROUP_A aa1 2
GROUP_A aa2 1
GROUP_B aa3 3
GROUP_B aa2 1
This is what I tried:
df.groupBy("type","id").count().sort("count").show()
However it does not give a correct result.
Upvotes: 2
Views: 72
Reputation: 41987
You can define the column names
when you create dataframe
and do the count on the grouped data
. This should be easy
import sqlContext.implicits._
val myData = Seq(("aa1", "GROUP_A", "10"),
("aa1","GROUP_A", "12"),
("aa2","GROUP_A", "hy"),
("aa2", "GROUP_B", "14"),
("aa3","GROUP_B", "11"),
("aa3","GROUP_B","12" ),
("aa3", "GROUP_B", "12"))
val df = sc.parallelize(myData).toDF("id", "type", "activity")
df.groupBy("type","id").count().sort("count").show()
Upvotes: 0
Reputation: 3110
I minimally changed your sample data and it works for me:
//yours
val myData = (Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "hy"),("aa2", "GROUP_B", "14"),("aa3","GROUP_B", "11"),("aa3","GROUP_B","12" ),("aa2", "GROUP_3", "12"))
//mine
//removed the ( at the beginning
//changed GROUP_3 to GROUP_B
//other minor changes so that the resultant group by will look like you desired
val myData = Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "12"),("aa3", "GROUP_B", "14"),("aa3","GROUP_B", "11"),("aa3","GROUP_B","12" ),("aa2", "GROUP_B", "12"))
//yours
val df = sc.parallelize(myData).toDF()
//mine
//added in column names
val df = sc.parallelize(myData).toDF("id","type","count")
df.groupBy("type","id").count.show
+-------+---+-----+
| type| id|count|
+-------+---+-----+
|GROUP_A|aa1| 2|
|GROUP_A|aa2| 1|
|GROUP_B|aa2| 1|
|GROUP_B|aa3| 3|
+-------+---+-----+
Is there something I missed?
Upvotes: 1