Reputation: 33
my data frame looks like:
categoryName | catA | catB |
---|---|---|
catA | 0.25 | 0.75 |
catB | 0.5 | 0.5 |
Where categoryName has String type, and cat* are Double. I would like to add column that will contain value from column which name is in the categoryName column:
categoryName | catA | catB | score |
---|---|---|---|
catA | 0.25 | 0.75 | 0.25 |
catB | 0.5 | 0.7 | 0.7 |
in the first row 'score' has value from column name 'catA' in the second row 'score' value from column name 'catB' Thank you
Upvotes: 3
Views: 2149
Reputation: 2946
One way is to create a map out of column names and values for each row, and then access the map with the value defined in a desired column.
What's cool about this is that it can work for as many columns as you want.
Example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
data = [
{"categoryName": "catA", "catA": 0.25, "catB": 0.75},
{"categoryName": "catB", "catA": 0.5, "catB": 0.7},
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df = (
df.withColumn(
"map", F.expr("map(" + ",".join([f"'{c}', {c}" for c in df.columns]) + ")")
)
.withColumn("score", F.expr("map[categoryName]"))
.drop("map")
)
Result:
+----+----+------------+-----+
|catA|catB|categoryName|score|
+----+----+------------+-----+
|0.25|0.75|catA |0.25 |
|0.5 |0.7 |catB |0.7 |
+----+----+------------+-----+
Upvotes: 4