Reputation: 3

PYSPARK How to create a udf with a dict and then add a column to a dataframe using the UDF

I need to create a UDF in pyspark that converts letter grades ('A', 'B', 'C', 'D', 'F') to numerical grades (4, 3, 2, 1, and 0). I then need to register this function as a spark UDF. Next, I have a dataframe 'current_gpa'. Current_gpa has a column named 'grade' I need to add a column to the dataframe current_gpa called 'num_grade' where the letter grades in the column 'grade' are converted to the corresponding numbers in the column 'num_grade'.

This is the UDF I created:

def get_num(letter):
    letter_class_dict = {"A": 1, "B": 2, "C": 3, "D": 4, "F": 5}
    for letter, l in letter_class_dict():
        x['letter'] = l
 
    return l

get_num =  udf(lambda letter: letter_class_dict.get(letter))
get_num_udf = F.udf(get_num, IntegerType())

This is the dataframe current_gpa:

+-------+-------+------+----+-----+-------+
| course|term_id|   sid| fid|grade|credits|
+-------+-------+------+----+-----+-------+
|BIO 101|  2000B|100001|1007|    F|      3|
|BIO 102|  2000B|100001|1007|    F|      4|
|CHM 101|  2000B|100001|1002|    F|      4|
|BIO 103|  2000B|100001|1007|    F|      4|
|GEN 114|  2000B|100001|1006|    F|      3|
+-------+-------+------+----+-----+-------+

I'm trying to use this UDF to add a column 'num_grade' where the values should look like:

+-------+-------+------+----+-----+-------+----------+
| course|term_id|   sid| fid|grade|credits|num_grades|
+-------+-------+------+----+-----+-------+----------+
|BIO 101|  2000B|100001|1007|    F|      3|         0|
|BIO 102|  2000B|100001|1007|    F|      4|         0|
|CHM 101|  2000B|100001|1002|    F|      4|         0|
|BIO 103|  2000B|100001|1007|    F|      4|         0|
|GEN 114|  2000B|100001|1006|    F|      3|         0|
+-------+-------+------+----+-----+-------+----------+

current_gpa = (
    grades
    .join(courses, 'course')
    .select('course', 'term_id', 'sid', 'fid', 'grade', 'credits')
    .withColumn('num_grade', get_num_udf(col('grade')))
    )

current_gpa.show()

This gives me the error: An exception was thrown from a UDF: 'RuntimeError: SparkContext should only be created and accessed on the driver.'. Full traceback below:

Upvotes: 0

Answers (2)

Kayleen Carlson

Reputation: 3

Here is how I ended up creating the UDF to convert letter grades to numbers:

def convert_grades(letter):
    letter_grades = {
    'A':4,
    'B': 3,
    'C':2,
    'D':1,
    'F':0
  }
    return letter_grades.get(letter)
 
grade_points = spark.udf.register('convert_grades', convert_grades)

Upvotes: 0

samkart

Reputation: 6654

you don't need an UDF for this operation, and you should always try to avoid UDFs (unless absolutely necessary) as spark is unable to optimize them which may lead to performance deterioration.

this is a simple case when (when().otherwise()) operation that can be built using the dictionary items in a list comprehension or python's native map function.

letter_class_dict = {"A": 4, "B": 3, "C": 2, "D": 1, "F": 0}

# create individual case when statement for each swap
letter_class_casewhens = map(lambda a: func.when(func.col('grade') == a[0], func.lit(a[1])), 
                             letter_class_dict.items()
                             )

# [Column<'CASE WHEN (grade = A) THEN 4 END'>,
#  Column<'CASE WHEN (grade = B) THEN 3 END'>,
#  Column<'CASE WHEN (grade = C) THEN 2 END'>,
#  Column<'CASE WHEN (grade = D) THEN 1 END'>,
#  Column<'CASE WHEN (grade = F) THEN 0 END'>]

# pass the case when statements in a `coalesce` function
data_sdf. \
    withColumn('num_grades', func.coalesce(*letter_class_casewhens)). \
    show()

# +-------+-------+------+----+-----+-------+----------+
# | course|term_id|   sid| fid|grade|credits|num_grades|
# +-------+-------+------+----+-----+-------+----------+
# |BIO 101|  2000B|100001|1007|    F|      3|         0|
# |BIO 102|  2000B|100001|1007|    F|      4|         0|
# |CHM 101|  2000B|100001|1002|    F|      4|         0|
# |BIO 103|  2000B|100001|1007|    F|      4|         0|
# |GEN 114|  2000B|100001|1006|    F|      3|         0|
# +-------+-------+------+----+-----+-------+----------+

Upvotes: 1

PYSPARK How to create a udf with a dict and then add a column to a dataframe using the UDF

Answers (2)

Related Questions