carlofacose
carlofacose

Reputation: 35

Problem with UDF in Spark - TypeError: 'Column' object is not callable

Hello everyone!
I have a dataframe with 2510765 rows containing application reviews with relative score and having the following structure:

root
 |-- content: string (nullable = true)
 |-- score: string (nullable = true)

I wrote these two functions, to remove punctuation and remove emojis from text:

import string

def remove_punct(text):
    return text.translate(str.maketrans('', '', string.punctuation))

and

import re

def removeEmoji(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

I use the udf function to create a spark function starting from the ones I defined for the removal of punctuation and emojis:

from pyspark.sql.functions import udf

punct_remove = udf(lambda s: remove_punct(s))

removeEmoji = udf(lambda s: removeEmoji(s))

But I get the following error:

TypeError                                 Traceback (most recent call last)

<ipython-input-29-e5d42d609b59> in <module>()
----> 1 new_df = new_df.withColumn("content", remove_punct(df_merge["content"]))
      2 new_df.show(5)

<ipython-input-21-dee888ef5b90> in remove_punct(text)
      2 
      3 def remove_punct(text):
----> 4     return text.translate(str.maketrans('', '', string.punctuation))
      5 
      6 

TypeError: 'Column' object is not callable

How can it be solved? Is there another way to make user-written functions run on the dataframe?
Thank you ;)

Upvotes: 1

Views: 1581

Answers (1)

werner
werner

Reputation: 14905

The stack trace suggests that you are calling the python method directly, not the udf.

remove_punct is a plain vanilla Python function while punct_remove is a udf that can be used as second parameter of the withColumn call.

One way to solve the problem is to use punct_remove instead of remove_punct in the withColumn call.

Another way to reduce the chance of mixing up the Python function with the udf is to use the @udf annotation:

from pyspark.sql import functions as F
from pyspark.sql import types as T

@F.udf(returnType=T.StringType())
def remove_punct(text):
    return text.translate(str.maketrans('', '', string.punctuation))

df.withColumn("content", remove_punct(F.col("content"))).show()

Upvotes: 1

Related Questions