how to update a row based on another row with same id

Question

With Spark dataframe, I want to update a row value based on other rows with same id.

For example, I have records below,

id,value
1,10
1,null
1,null
2,20
2,null
2,null

I want to get the result as below

id,value
1,10
1,10
1,10
2,20
2,20
2,20

To summarize, the value column is null in some rows, I want to update them if there is another row with same id which has valid value.

In sql, I can simply write a update sentence with inner-join, but I didn't find the same way in Spark-sql.

update combineCols a inner join combineCols b on a.id = b.id set a.value = b.value (this is how I do it in sql)

hamza tuna · Accepted Answer

You can use window to do this(in pyspark):

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# create dataframe
df = sc.parallelize([
    [1,10],
    [1,None],
    [1,None],
    [2,20],
    [2,None],
    [2,None],
]).toDF(('id', 'value'))

window = Window.partitionBy('id').orderBy(F.desc('value'))
df \
    .withColumn('value', F.first('value').over(window)) \
    .show()

Results:

+---+-----+
| id|value|
+---+-----+
|  1|   10|
|  1|   10|
|  1|   10|
|  2|   20|
|  2|   20|
|  2|   20|
+---+-----+

You can use the same functions in scala.

how to update a row based on another row with same id

Answers (2)

Related Questions