Reputation: 2439
I have a following DataFrame:
+-----------+----------+----------+
| some_id | one_col | other_col|
+-----------+----------+----------+
| xx1 | 11| 177|
| xx2 | 1613| 2000|
| xx4 | 0| 12473|
+-----------+----------+----------+
I need to add a new column which is based on some calculations done on the first and second column, namely, for example, for col1_value=1 and col2_value=10 would need to produce a percentage of col1 that is included in col2, so col3_value= (1/10)*100=10%:
+-----------+----------+----------+--------------+
| some_id | one_col | other_col| percentage |
+-----------+----------+----------+--------------+
| xx1 | 11| 177| 6.2 |
| xx3 | 1| 10 | 10 |
| xx2 | 1613| 2000| 80.6 |
| xx4 | 0| 12473| 0 |
+-----------+----------+----------+--------------+
I know I would need to use a udf for this, but how do I directly add a new column value based on the outcome?
Some pseudo-code:
import pyspark
from pyspark.sql.functions import udf
df = load_my_df
def my_udf(val1, val2):
return (val1/val2)*100
udf_percentage = udf(my_udf, FloatType())
df = df.withColumn('percentage', udf_percentage(# how?))
Thank you!
Upvotes: 1
Views: 10328
Reputation:
df.withColumn('percentage', udf_percentage("one_col", "other_col"))
or
df.withColumn('percentage', udf_percentage(df["one_col"], df["other_col"]))
or
df.withColumn('percentage', udf_percentage(df.one_col, df.other_col))
or
from pyspark.sql.functions import col
df.withColumn('percentage', udf_percentage(col("one_col"), col("other_col")))
but why not just:
df.withColumn('percentage', col("one_col") / col("other_col") * 100)
Upvotes: 5