Computing one value from multiple values in row

Question

I have a PySpark Dataframe, and I'd like to add a column computed from multiple values from the other columns.

For instance let's say I have a simple dataframe with ages and names of people, and I want to compute some value, like age*2 + len(name). Can I do this with a udf or a .withColumn?

from pyspark.sql import Row
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = sqlContext.createDataFrame(people)
display(schemaPeople)

Be Chiller Too · Accepted Answer

I found an way to do this with @udf:

@udf
def complex_op(age, name):
    return age*2 + len(name)

schemaPeople.withColumn(
    "my_column",
    lit(complex_op(schemaPeople["age"], schemaPeople["name"]))
    )

Computing one value from multiple values in row

Answers (2)

Related Questions