Reputation: 2938
I have a PySpark Dataframe, and I'd like to add a column computed from multiple values from the other columns.
For instance let's say I have a simple dataframe with ages and names of people, and I want to compute some value, like age*2 + len(name)
. Can I do this with a udf
or a .withColumn
?
from pyspark.sql import Row
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = sqlContext.createDataFrame(people)
display(schemaPeople)
Upvotes: 1
Views: 50
Reputation: 2938
I found an way to do this with @udf
:
@udf
def complex_op(age, name):
return age*2 + len(name)
schemaPeople.withColumn(
"my_column",
lit(complex_op(schemaPeople["age"], schemaPeople["name"]))
)
Upvotes: 0
Reputation: 15318
Use withColumn
:
from pyspark.sql import functions as F
schemaPeople.withColumn(
"my_column",
F.col("age")*2 + F.length(F.col("name"))
).show()
Upvotes: 3