Dolfina
Dolfina

Reputation: 33

pyspark - Dynamically select column content based on other column from the same row

my data frame looks like:


categoryName catA catB
catA 0.25 0.75
catB 0.5 0.5

Where categoryName has String type, and cat* are Double. I would like to add column that will contain value from column which name is in the categoryName column:


categoryName catA catB score
catA 0.25 0.75 0.25
catB 0.5 0.7 0.7

in the first row 'score' has value from column name 'catA' in the second row 'score' value from column name 'catB' Thank you

Upvotes: 3

Views: 2149

Answers (1)

vladsiv
vladsiv

Reputation: 2946

One way is to create a map out of column names and values for each row, and then access the map with the value defined in a desired column.

What's cool about this is that it can work for as many columns as you want.

Example:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

data = [
    {"categoryName": "catA", "catA": 0.25, "catB": 0.75},
    {"categoryName": "catB", "catA": 0.5, "catB": 0.7},
]

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df = (
    df.withColumn(
        "map", F.expr("map(" + ",".join([f"'{c}', {c}" for c in df.columns]) + ")")
    )
    .withColumn("score", F.expr("map[categoryName]"))
    .drop("map")
)

Result:

+----+----+------------+-----+                                                  
|catA|catB|categoryName|score|
+----+----+------------+-----+
|0.25|0.75|catA        |0.25 |
|0.5 |0.7 |catB        |0.7  |
+----+----+------------+-----+

Upvotes: 4

Related Questions