Reputation: 580
Consider an example dataframe:
df =
+-------+-----+
| tech|state|
+-------+-----+
| 70|wa |
| 50|mn |
| 20|fl |
| 50|mo |
| 10|ar |
| 90|wi |
| 30|al |
| 50|ca |
+-------+-----+
I want to change the 'tech' column such that any value of 50 is changed to 1 and all other values are equal to 0.
The output would look like this:
df =
+-------+-----+
| tech|state|
+-------+-----+
| 0 |wa |
| 1 |mn |
| 0 |fl |
| 1 |mo |
| 0 |ar |
| 0 |wi |
| 0 |al |
| 1 |ca |
+-------+-----+
Here's what I have so far:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
changing_column = 'tech'
udf_first = UserDefinedFunction(lambda x: 1, IntegerType())
udf_second = UserDefinedFunction(lambda x: 0, IntegerType())
first_df = zero_df.select(*[udf_first(changing_column) if column == 50 else column for column in zero_df])
second_df = first_df.select(*[udf_second(changing_column) if column != 50 else column for column in first_df])
second_df.show()
Upvotes: 0
Views: 502
Reputation: 2767
hope this helps
from pyspark.sql.functions import when
df = spark\
.createDataFrame([\
(70, 'wa'),\
(50, 'mn'),\
(20, 'fl')],\
["tech", "state"])
df\
.select("*", when(df.tech == 50, 1)\
.otherwise(0)\
.alias("tech"))\
.show()
+----+-----+----+
|tech|state|tech|
+----+-----+----+
| 70| wa| 0|
| 50| mn| 1|
| 20| fl| 0|
+----+-----+----+
Upvotes: 1