Reputation: 161
I'm trying to remove punctuation from my tokenized text with regex. I'm using spark dataframes. This is my function:
def removePunctuation(column):
return trim(lower(regexp_replace(column,'[^\sa-zA-Z0-9]', ''))).alias('stopped')
When I'm executing this function by:
removed_df.select(removePunctuation(col('stopped'))).show(truncate=False)
I have the error:
Py4JJavaError: An error occurred while calling o736.select.
: org.apache.spark.sql.AnalysisException: cannot resolve 'regexp_replace(`stopped`, '[^\\sa-zA-Z0-9]', '')' due to data type mismatch: argument 1 requires string type, however, '`stopped`' is of array<string> type.;;
Is there any way to remove punctuation by this function? What is wrong with it?
Upvotes: 2
Views: 8449
Reputation: 1
Here is an alternate approach:
from pyspark.sql.functions import *
df=pysparkDF.withColumn('someColOfStrings', translate('someColOfStrings', '!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~', ''))
tokenizer = Tokenizer(outputCol="textTokens")
tokenizer.setInputCol("someColOfStrings")
df1=tokenizer.transform(df.dropna())
With this code, you can tokenize the values in the strings after you remove the punctuations.
Upvotes: 0
Reputation: 1174
The error message says that your column stopped
is of type array<string>
rather than string
. You need a string column for regexp_replace
.
In order to apply if to an array of strings you can first create a string out of the array and then split that string again
def removePunctuation(column):
return split(trim(lower(regexp_replace(concat_ws("SEPARATORSTRING", column),'[^\sa-zA-Z0-9]', ''))), "SEPARATORSTRING").alias('stopped')
Upvotes: 3