Reputation: 30819
I have the following json
file that I am reading in Spark:
{"id" : "123", "category" : "A"}
{"id" : "456", "category" : "B"}
{"id" : "567", "category" : "C"}
{"id" : "678", "category" : "D"}
{"id" : "789", "category" : "E"}
{"id" : "123", "category" : "B"}
{"id" : "123", "category" : "C"}
I want to pivot
it to generate mxn matrix of categories. Below is my code:
val df = spark.read.json("test.json")
val pivots = df.groupBy("category").pivot("category").count()
pivots.show()
This generates the following output:
+--------+----+----+----+----+----+
|category| A| B| C| D| E|
+--------+----+----+----+----+----+
| E|null|null|null|null| 1|
| B|null| 2|null|null|null|
| D|null|null|null| 1|null|
| C|null|null| 2|null|null|
| A| 1|null|null|null|null|
+--------+----+----+----+----+----+
What I really want to do is, to pivot by id
and show the counts in this matrix. E.g. id
'123' is mapped to categories A
, B
and C
, so I want the above matrix to show 1
value for A-A
, A-B
and A-C
(and vice versa), currently, it's only showing 1
value for A-A
.
I changed the code to group by
'id' but that results in id
values in first column whereas I want categories.
Is there any way I can do this (maybe using a completely different approach)?
Upvotes: 3
Views: 5044
Reputation: 330093
Fist rename the columns and apply self-join:
val leftRight = df
.withColumnRenamed("category", "left")
.join(df.withColumnRenamed("category", "right"), Seq("id"))
to get co-occurrences for each id. Next apply crosstab
:
leftRight.stat.crosstab("left", "right")
to aggregate data across all ids. The result is:
+----------+---+---+---+---+---+
|left_right| A| B| C| D| E|
+----------+---+---+---+---+---+
| E| 0| 0| 0| 0| 1|
| A| 1| 1| 1| 0| 0|
| B| 1| 2| 1| 0| 0|
| C| 1| 1| 2| 0| 0|
| D| 0| 0| 0| 1| 0|
+----------+---+---+---+---+---+
Upvotes: 3