pyspark - Dynamically select column content based on other column from the same row

Question

my data frame looks like:

categoryName	catA	catB
catA	0.25	0.75
catB	0.5	0.5

Where categoryName has String type, and cat* are Double. I would like to add column that will contain value from column which name is in the categoryName column:

categoryName	catA	catB	score
catA	0.25	0.75	0.25
catB	0.5	0.7	0.7

in the first row 'score' has value from column name 'catA' in the second row 'score' value from column name 'catB' Thank you

vladsiv · Accepted Answer

One way is to create a map out of column names and values for each row, and then access the map with the value defined in a desired column.

What's cool about this is that it can work for as many columns as you want.

Example:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

data = [
    {"categoryName": "catA", "catA": 0.25, "catB": 0.75},
    {"categoryName": "catB", "catA": 0.5, "catB": 0.7},
]

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df = (
    df.withColumn(
        "map", F.expr("map(" + ",".join([f"'{c}', {c}" for c in df.columns]) + ")")
    )
    .withColumn("score", F.expr("map[categoryName]"))
    .drop("map")
)

Result:

+----+----+------------+-----+                                                  
|catA|catB|categoryName|score|
+----+----+------------+-----+
|0.25|0.75|catA        |0.25 |
|0.5 |0.7 |catB        |0.7  |
+----+----+------------+-----+

pyspark - Dynamically select column content based on other column from the same row

Answers (1)

Related Questions