lserlohn
lserlohn

Reputation: 6216

error when passing broadcast variable into UDF, Pyspark

I have a function, which trys to pass a broadcast variable into UDF.

The function looks like:

def generate_lookup_code(self, lookup_map):

    lookup_map_broadcast = spark_session.sparkContext.broadcast(lookup_map)
    print("lookup_map has been broadcasted")

    #### UDF function only return a constant string###
    def _generate_code(bc_reasoncode_lookup_map):

        reasoncode_lookup_map = bc_reasoncode_lookup_map.value
        return "hello"


    udfGenerateCode = F.udf(_generate_code, StringType())

    input_df = input_df.withColumn('code', udfGenerateCode(lookup_map_broadcast))

    input_df.show()

My intention is only trying to pass the broadcast variable to the UDF, however, I got the error:

'Broadcast' object has no attribute '_get_object_id'

I have no idea where is wrong?

Upvotes: 1

Views: 6287

Answers (1)

Sergey Khudyakov
Sergey Khudyakov

Reputation: 1182

You don't need to pass a broadcasted variable as a UDF argument, just reference it from the function:

lookup_map_broadcast = spark_session.sparkContext.broadcast(lookup_map)

def _generate_code():
    reasoncode_lookup_map = lookup_map_broadcast.value
    return "hello"

udfGenerateCode = F.udf(_generate_code, StringType())
input_df = input_df.withColumn('code', udfGenerateCode())

A UDF is called for each row and it can accept either a column or literal.

Upvotes: 4

Related Questions