Reputation: 13
I found the following function online:
def RemoveStops(data_str):
#nltk.download('stopwords')
english_stopwords = stopwords.words("english")
broadcast(english_stopwords)
# expects a string
stops = set(english_stopwords)
list_pos = 0
cleaned_str = ''
text = data_str.split()
for word in text:
if word not in stops:
# rebuild cleaned_str
if list_pos == 0:
cleaned_str = word
else:
cleaned_str = cleaned_str + ' ' + word
list_pos += 1
return cleaned_str
and then I am doing the following:
ColumntoClean = udf(lambda x: RemoveStops(x), StringType())
data = data.withColumn("CleanedText", ColumntoClean(data[TextColumn]))
The error I am getting is the following:
PicklingError: args[0] from newobj args has the wrong class
Funny thing is if I rerun the same set of code, it runs and throws no pickling error. Can someone help me resolve this issue? Thank you!
Upvotes: 1
Views: 703
Reputation: 146
Just change your function this way and it should run.
nltk.download('stopwords')
english_stopwords = stopwords.words("english")
def RemoveStops(data_str):
# expects a string
stops = set(english_stopwords)
list_pos = 0
cleaned_str = ''
text = data_str.split()
for word in text:
if word not in stops:
# rebuild cleaned_str
if list_pos == 0:
cleaned_str = word
else:
cleaned_str = cleaned_str + ' ' + word
list_pos += 1
return cleaned_str
Databricks is pain when it comes to nltk. It doesn't allow stopwords.words("english") to run inside a function while applying udf.
Upvotes: 1