3yakuya
3yakuya

Reputation: 2672

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.

I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.

I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.

How could I get the string value of the concatenation, so that I can select the desired column?

Upvotes: 0

Views: 5712

Answers (2)

Mariusz
Mariusz

Reputation: 13936

The easiest way in your case would be just a simple when/oterwise like:

>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"]) 
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
|            1|            2|           one|            1|
|            1|            2|           two|            2|
+-------------+-------------+--------------+-------------+

As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

Upvotes: 2

E.ZY.
E.ZY.

Reputation: 725

from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))

Upvotes: 2

Related Questions