Carrot
Carrot

Reputation: 431

Update pyspark dataframe from a column having the target column values

I have a dataframe which has a column('target_column' in this case) and I need to update these target columns with 'val' column values.

I have tried using udfs and .withcolumn but they all expect fixed column value. In my case it can be variable. Also using rdd map transformations didn't work as rdds are immutable.

def test():

    data = [("jose_1", 'mase', "firstname", "jane"), ("li_1", "ken", 'lastname', 'keno'), ("liz_1", 'durn', 'firstname', 'liz')]
    source_df = spark.createDataFrame(data, ["firstname", "lastname", "target_column", "val"])
    source_df.show()


if __name__ == "__main__":
    spark = SparkSession.builder.appName('Name Group').getOrCreate()
    test()
    spark.stop()

Input:

+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
|   jose_1|    mase|    firstname|jane|
|     li_1|     ken|     lastname|keno|
|    liz_1|    durn|    firstname| liz|
+---------+--------+-------------+----+

Expected output:

+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
|     jane|    mase|    firstname|jane|
|     li_1|    keno|     lastname|keno|
|      liz|    durn|    firstname| liz|
+---------+--------+-------------+----+

For e.g. in first row in input the target_column is 'firstname' and val is 'jane'. So I need to update the firstname with 'jane' in that row.

Thanks

Upvotes: 0

Views: 621

Answers (1)

Steven
Steven

Reputation: 15258

You can do a loop with all you columns:

from pyspark.sql import functions as F

for col in df.columns:
    df = df.withColumn(
        col,
        F.when(
            F.col("target_column")==F.lit(col), 
            F.col("val")
        ).otherwise(F.col(col))
    )

Upvotes: 1

Related Questions