Reputation: 35
Hello everyone!
I have a dataframe with 2510765 rows containing application reviews with relative score and having the following structure:
root
|-- content: string (nullable = true)
|-- score: string (nullable = true)
I wrote these two functions, to remove punctuation and remove emojis from text:
import string
def remove_punct(text):
return text.translate(str.maketrans('', '', string.punctuation))
and
import re
def removeEmoji(text):
regrex_pattern = re.compile(pattern = "["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags = re.UNICODE)
return regrex_pattern.sub(r'',text)
I use the udf
function to create a spark function starting from the ones I defined for the removal of punctuation and emojis:
from pyspark.sql.functions import udf
punct_remove = udf(lambda s: remove_punct(s))
removeEmoji = udf(lambda s: removeEmoji(s))
But I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-29-e5d42d609b59> in <module>()
----> 1 new_df = new_df.withColumn("content", remove_punct(df_merge["content"]))
2 new_df.show(5)
<ipython-input-21-dee888ef5b90> in remove_punct(text)
2
3 def remove_punct(text):
----> 4 return text.translate(str.maketrans('', '', string.punctuation))
5
6
TypeError: 'Column' object is not callable
How can it be solved? Is there another way to make user-written functions run on the dataframe?
Thank you ;)
Upvotes: 1
Views: 1581
Reputation: 14905
The stack trace suggests that you are calling the python method directly, not the udf.
remove_punct
is a plain vanilla Python function while punct_remove
is a udf that can be used as second parameter of the withColumn
call.
One way to solve the problem is to use punct_remove
instead of remove_punct
in the withColumn
call.
Another way to reduce the chance of mixing up the Python function with the udf is to use the @udf
annotation:
from pyspark.sql import functions as F
from pyspark.sql import types as T
@F.udf(returnType=T.StringType())
def remove_punct(text):
return text.translate(str.maketrans('', '', string.punctuation))
df.withColumn("content", remove_punct(F.col("content"))).show()
Upvotes: 1