Removing punctuation in spark dataframe

Question

I'm trying to remove punctuation from my tokenized text with regex. I'm using spark dataframes. This is my function:

def removePunctuation(column):
     return trim(lower(regexp_replace(column,'[^\sa-zA-Z0-9]', ''))).alias('stopped')

When I'm executing this function by:

removed_df.select(removePunctuation(col('stopped'))).show(truncate=False)

I have the error:

Py4JJavaError: An error occurred while calling o736.select.
: org.apache.spark.sql.AnalysisException: cannot resolve 'regexp_replace(`stopped`, '[^\sa-zA-Z0-9]', '')' due to data type mismatch: argument 1 requires string type, however, '`stopped`' is of array type.;;

Is there any way to remove punctuation by this function? What is wrong with it?

Paul · Accepted Answer

The error message says that your column stopped is of type array rather than string. You need a string column for regexp_replace.

In order to apply if to an array of strings you can first create a string out of the array and then split that string again

def removePunctuation(column):
     return split(trim(lower(regexp_replace(concat_ws("SEPARATORSTRING", column),'[^\sa-zA-Z0-9]', ''))), "SEPARATORSTRING").alias('stopped')

Removing punctuation in spark dataframe

Answers (2)

Related Questions