earl
earl

Reputation: 768

Call a function for each row of a dataframe in pyspark[non pandas]

There is a function in pyspark:

def sum(a,b):
    c=a+b
    return c

It has to be run on each record of a very very large dataframe using spark sql:

x = sum(df.select["NUM1"].first()["NUM1"], df.select["NUM2"].first()["NUM2"])

But this would run it only for the first record of the df and not for all rows. I understand it could be done using a lambda, but I am not able to code it in the desired way.

In reality; c would be a dataframe and the function would be doing a lot of spark.sql stuff and return it. I would have to call that function for each row. I guess, I will try to pick it up using this sum(a,b) as an analogy.

+----------+----------+-----------+
|     NUM1 |     NUM2 |    XYZ    |
+----------+----------+-----------+
|      10  |     20   |      HELLO|                                    
|      90  |     60   |      WORLD|
|      50  |     45   |      SPARK|
+----------+----------+-----------+


+----------+----------+-----------+------+
|     NUM1 |     NUM2 |    XYZ    | VALUE|
+----------+----------+-----------+------+
|      10  |     20   |      HELLO|30    |                                     
|      90  |     60   |      WORLD|150   |
|      50  |     45   |      SPARK|95    |
+----------+----------+-----------+------+

Python: 3.7.4
Spark: 2.2

Upvotes: 4

Views: 17149

Answers (3)

Manjunatha H C
Manjunatha H C

Reputation: 247

Use the below simple approach:

1. Import pyspark.sql functions

from pyspark.sql import functions as F

2. Use F.expr() function

df.withColumn("VALUE",F.expr("NUM1+NUM2")<br>

Upvotes: 0

Naidu_DPN
Naidu_DPN

Reputation: 41

We can do it in below ways and while registering udf 3rd argument that is return type is not mandatory.

from pyspark.sql import functions as F
df1 = spark.createDataFrame([(10,20,'HELLO'),(90,60,'WORLD'),(50,45,'SPARK')],['NUM1','NUM2','XYZ'])
df1.show()
df2=df1.withColumn('VALUE',F.expr('NUM1 + NUM2'))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ  |VALUE|
+----+----+-----+-----+
|10  |20  |HELLO|30   |
|90  |60  |WORLD|150  |
|50  |45  |SPARK|95   |
+----+----+-----+-----+


(or)

def sum(c1,c2):
    return c1+c2
spark.udf.register('sum_udf1',sum)
df2=df1.withColumn('VALUE',F.expr("sum_udf1(NUM1,NUM2)"))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ  |VALUE|
+----+----+-----+-----+
|10  |20  |HELLO|30   |
|90  |60  |WORLD|150  |
|50  |45  |SPARK|95   |
+----+----+-----+-----+
(or)

sum_udf2=F.udf(lambda x,y: x+y)
df2=df1.withColumn('VALUE',sum_udf2('NUM1','NUM2'))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ  |VALUE|
+----+----+-----+-----+
|10  |20  |HELLO|30   |
|90  |60  |WORLD|150  |
|50  |45  |SPARK|95   |
+----+----+-----+-----+

Upvotes: 2

Vincent
Vincent

Reputation: 591

You can use .withColumn function:

from pyspark.sql.functions import col
from pyspark.sql.types import LongType
df.show()
+----+----+-----+
|NUM1|NUM2|  XYZ|
+----+----+-----+
|  10|  20|HELLO|
|  90|  60|WORLD|
|  50|  45|SPARK|
+----+----+-----+

def mysum(a,b):
  return a + b

spark.udf.register("mysumudf", mysum, LongType())

df2 = df.withColumn("VALUE", mysum(col("NUM1"),col("NUM2"))

df2.show()
+----+----+-----+-----+
|NUM1|NUM2|  XYZ|VALUE|
+----+----+-----+-----+
|  10|  20|HELLO|   30|
|  90|  60|WORLD|  150|
|  50|  45|SPARK|   95|
+----+----+-----+-----+

Upvotes: 3

Related Questions