mperic
mperic

Reputation: 23

Spark Scala update dataframe

I have problem like this:

val data = Seq(("TIM", "FIRST", "A", 1),
                   ("BIM", "SECOND", "A", 2),
                   ("JIM", "THIRD", "B", 1)).toDF("NAME", "POSITION", "GROUP", "INDEX")

    data.show()
    data.printSchema()

    val title = Seq(("A", "MASTER"), ("B", "TEACHER"),
                    ("C", "STUDENT")).toDF("LETTER", "DEGREE")

    title.show()
    title.printSchema()

+----+--------+-----+-----+
|NAME|POSITION|GROUP|INDEX|
+----+--------+-----+-----+
| TIM|   FIRST|    A|    1|
| BIM|  SECOND|    A|    2|
| JIM|   THIRD|    B|    1|
+----+--------+-----+-----+

root
 |-- NAME: string (nullable = true)
 |-- POSITION: string (nullable = true)
 |-- GROUP: string (nullable = true)
 |-- INDEX: integer (nullable = false)

+------+-------+
|LETTER| DEGREE|
+------+-------+
|     A| MASTER|
|     B|TEACHER|
|     C|STUDENT|
+------+-------+

root
 |-- LETTER: string (nullable = true)
 |-- DEGREE: string (nullable = true)

//Final result
+----+--------+-------+--'--+
|NAME|POSITION|  GROUP|INDEX|
+----+--------+-------+-----+
| TIM|   FIRST| MASTER|   1 |
| BIM|  SECOND|      A|   2 |
| JIM|   THIRD|TEACHER|   1 |
+----+--------+-------+-----+


I tried several things:

val result = data.withColumn("GROUP", when('INDEX === 1, ???????????))

Where are question marks are I tried calling UDF but I cannot get current row value from GROUP to pass as parameter to UDF. Also tried putting there select to TITLE and GROUP = LETTER and nothing worked.

First dataframe is huge, and other is very small in production.

Are some elegant way without joinig them first and then withColumn on join?

Thank you

Upvotes: 0

Views: 52

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

Use a broadcast join :

data
  .join(broadcast(title),$"GROUP"===$"LETTER")
  .withColumn("GROUP",when($"INDEX"=== 1,$"DEGREE").otherwise($"GROUP"))
  .drop("LETTER","DEGREE")
  .show()

+----+--------+-------+-----+
|NAME|POSITION|  GROUP|INDEX|
+----+--------+-------+-----+
| TIM|   FIRST| MASTER|    1|
| BIM|  SECOND|      A|    2|
| JIM|   THIRD|TEACHER|    1|
+----+--------+-------+-----+

You could also collect title to a lookup-map, broadcast this map and use UDF, but there is really no advantage over broadcast join

Upvotes: 1

Related Questions