Darshan Mehta
Darshan Mehta

Reputation: 30819

Spark : Pivot with multiple columns

I have the following json file that I am reading in Spark:

{"id" : "123", "category" : "A"}
{"id" : "456", "category" : "B"}
{"id" : "567", "category" : "C"}
{"id" : "678", "category" : "D"}
{"id" : "789", "category" : "E"}
{"id" : "123", "category" : "B"}
{"id" : "123", "category" : "C"}

I want to pivot it to generate mxn matrix of categories. Below is my code:

val df = spark.read.json("test.json")
val pivots = df.groupBy("category").pivot("category").count()
pivots.show()

This generates the following output:

+--------+----+----+----+----+----+
|category|   A|   B|   C|   D|   E|
+--------+----+----+----+----+----+
|       E|null|null|null|null|   1|
|       B|null|   2|null|null|null|
|       D|null|null|null|   1|null|
|       C|null|null|   2|null|null|
|       A|   1|null|null|null|null|
+--------+----+----+----+----+----+

What I really want to do is, to pivot by id and show the counts in this matrix. E.g. id '123' is mapped to categories A, B and C, so I want the above matrix to show 1 value for A-A, A-B and A-C (and vice versa), currently, it's only showing 1 value for A-A.

I changed the code to group by 'id' but that results in id values in first column whereas I want categories.

Is there any way I can do this (maybe using a completely different approach)?

Upvotes: 3

Views: 5044

Answers (1)

zero323
zero323

Reputation: 330093

Fist rename the columns and apply self-join:

val leftRight = df
  .withColumnRenamed("category", "left")
  .join(df.withColumnRenamed("category", "right"), Seq("id"))

to get co-occurrences for each id. Next apply crosstab:

leftRight.stat.crosstab("left", "right")

to aggregate data across all ids. The result is:

+----------+---+---+---+---+---+
|left_right|  A|  B|  C|  D|  E|
+----------+---+---+---+---+---+
|         E|  0|  0|  0|  0|  1|
|         A|  1|  1|  1|  0|  0|
|         B|  1|  2|  1|  0|  0|
|         C|  1|  1|  2|  0|  0|
|         D|  0|  0|  0|  1|  0|
+----------+---+---+---+---+---+

Upvotes: 3

Related Questions