Equivalent code from python apply function to Pyspark code using udf

Question

df = spark.createDataFrame(
        [
            (1, "AxtTR"),  # create your data here, be consistent in the types.
            (2, "HdyOP"),
            (3, "EqoPIC"),
            (4, "OkTEic"),
        ], ["id", "label"] )# add your column names here]
df.show()

Below code is in python , where i use apply function and tried extracting first 2 letters of every row. i want to replicate the same code in pyspark. where a function is used to apply on every single row and get the output.

def get_string(lst):    
    lst = str(lst)
    lst = lst.lower
    lst=  lst[0:2]
    return(lst)
df['firt_2letter'] = df['label'].apply(get_string)

The yellow marked as shown in below image is the expected output.

mck · Accepted Answer

You can use the relevant Spark SQL functions:

import pyspark.sql.functions as F

df2 = df.withColumn('first_2letter', F.lower('label')[0:2])

df2.show()
+---+------+-------------+
| id| label|first_2letter|
+---+------+-------------+
|  1| AxtTR|           ax|
|  2| HdyOP|           hd|
|  3|EqoPIC|           eq|
|  4|OkTEic|           ok|
+---+------+-------------+

If you want to use user-defined functions, you can define them as:

def get_string(lst):    
    lst = str(lst)
    lst = lst.lower()
    lst = lst[0:2]
    return lst

import pyspark.sql.functions as F

df2 = df.withColumn('first_2letter', F.udf(get_string)('label'))

df2.show()
+---+------+-------------+
| id| label|first_2letter|
+---+------+-------------+
|  1| AxtTR|           ax|
|  2| HdyOP|           hd|
|  3|EqoPIC|           eq|
|  4|OkTEic|           ok|
+---+------+-------------+

Equivalent code from python apply function to Pyspark code using udf

Answers (1)

Related Questions