Be Chiller Too
Be Chiller Too

Reputation: 2938

Computing one value from multiple values in row

I have a PySpark Dataframe, and I'd like to add a column computed from multiple values from the other columns.

For instance let's say I have a simple dataframe with ages and names of people, and I want to compute some value, like age*2 + len(name). Can I do this with a udf or a .withColumn?

from pyspark.sql import Row
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = sqlContext.createDataFrame(people)
display(schemaPeople)

Upvotes: 1

Views: 50

Answers (2)

Be Chiller Too
Be Chiller Too

Reputation: 2938

I found an way to do this with @udf:

@udf
def complex_op(age, name):
    return age*2 + len(name)

schemaPeople.withColumn(
    "my_column",
    lit(complex_op(schemaPeople["age"], schemaPeople["name"]))
    )

Upvotes: 0

Steven
Steven

Reputation: 15318

Use withColumn:

from pyspark.sql import functions as F

schemaPeople.withColumn(
    "my_column",
    F.col("age")*2 + F.length(F.col("name"))
).show()

Upvotes: 3

Related Questions