oyvindhauge
oyvindhauge

Reputation: 3693

Update value in column based on other column values in Spark

I want to set the value of a column in a Spark DataFrame based on the values of an arbitrary number of other columns in the row.

I realise I can do it like this:

df.withColumn("IsValid", when($"col1" === $"col2" && $"col3" === $"col4", true).otherwise(false))

But there has to be a better way of doing this for data frames with 20+ columns.

The row contains an even number of columns that should be checked pairwise in order to know if the "IsValid" column will be true or false.

Upvotes: 0

Views: 1159

Answers (2)

blackbishop
blackbishop

Reputation: 32710

Another way to group the columns pairwise and construct the condition for when function :

val condition = df.columns.grouped(2).map{ case Array(a, b) => col(a) === col(b)}.reduce(_ and _)

val df1 = df.withColumn("IsValid", when(condition,true).otherwise(false)) 

Upvotes: 1

mck
mck

Reputation: 42422

You can try to map and reduce the list of columns to the condition that you wanted:

val cond = (0 to df.columns.length - 1 by 2)
           .map(i => (col(df.columns(i)) === col(df.columns(i+1))))
           .reduce(_ && _)

df.withColumn("IsValid", when(cond, true).otherwise(false))

Upvotes: 1

Related Questions