iluvex
iluvex

Reputation: 115

how to update a row based on another row with same id

With Spark dataframe, I want to update a row value based on other rows with same id.

For example, I have records below,

id,value
1,10
1,null
1,null
2,20
2,null
2,null

I want to get the result as below

id,value
1,10
1,10
1,10
2,20
2,20
2,20

To summarize, the value column is null in some rows, I want to update them if there is another row with same id which has valid value.

In sql, I can simply write a update sentence with inner-join, but I didn't find the same way in Spark-sql.

update combineCols a inner join combineCols b on a.id = b.id set a.value = b.value (this is how I do it in sql)

Upvotes: 1

Views: 1357

Answers (2)

cph_sto
cph_sto

Reputation: 7587

Let's use SQL method to solve this issue -

myValues = [(1,10),(1,None),(1,None),(2,20),(2,None),(2,None)]
df = sqlContext.createDataFrame(myValues,['id','value'])

df.registerTempTable('table_view')
df1=sqlContext.sql(
    'select id, sum(value) over (partition by id) as value from table_view'
)
df1.show()
+---+-----+
| id|value|
+---+-----+
|  1|   10|
|  1|   10|
|  1|   10|
|  2|   20|
|  2|   20|
|  2|   20|
+---+-----+

Caveat: Thos code assumes that there is only one non-null value for any particular id. When we groupby values, we have to use an aggregation function, and I have used sum. In case there are 2 non-null values for any id, then the will be summed up. If id could have multiple non-null values, then it's bettwe to use min/max, so that we get one of the values rather than sum.

df1=sqlContext.sql(
    'select id, max(value) over (partition by id) as value from table_view'
)

Upvotes: 1

hamza tuna
hamza tuna

Reputation: 1497

You can use window to do this(in pyspark):

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# create dataframe
df = sc.parallelize([
    [1,10],
    [1,None],
    [1,None],
    [2,20],
    [2,None],
    [2,None],
]).toDF(('id', 'value'))

window = Window.partitionBy('id').orderBy(F.desc('value'))
df \
    .withColumn('value', F.first('value').over(window)) \
    .show()

Results:

+---+-----+
| id|value|
+---+-----+
|  1|   10|
|  1|   10|
|  1|   10|
|  2|   20|
|  2|   20|
|  2|   20|
+---+-----+

You can use the same functions in scala.

Upvotes: 0

Related Questions