Smashett
Smashett

Reputation: 13

Pickling Error while using Stopwords from NLTK in pyspark (databricks)

I found the following function online:

def RemoveStops(data_str):    
    #nltk.download('stopwords')
    english_stopwords = stopwords.words("english")
    broadcast(english_stopwords)
    # expects a string
    stops = set(english_stopwords)
    list_pos = 0
    cleaned_str = ''
    text = data_str.split()
    for word in text:
        if word not in stops:
            # rebuild cleaned_str
            if list_pos == 0:
                cleaned_str = word
            else:
                cleaned_str = cleaned_str + ' ' + word
            list_pos += 1
    return cleaned_str

and then I am doing the following:

ColumntoClean = udf(lambda x: RemoveStops(x), StringType())
data = data.withColumn("CleanedText", ColumntoClean(data[TextColumn]))

The error I am getting is the following:

PicklingError: args[0] from newobj args has the wrong class

Funny thing is if I rerun the same set of code, it runs and throws no pickling error. Can someone help me resolve this issue? Thank you!

Upvotes: 1

Views: 703

Answers (1)

Maddy
Maddy

Reputation: 146

Just change your function this way and it should run.

nltk.download('stopwords')
english_stopwords = stopwords.words("english")
def RemoveStops(data_str):    
    # expects a string
    stops = set(english_stopwords)
    list_pos = 0
    cleaned_str = ''
    text = data_str.split()
    for word in text:
        if word not in stops:
            # rebuild cleaned_str
            if list_pos == 0:
                cleaned_str = word
            else:
                cleaned_str = cleaned_str + ' ' + word
            list_pos += 1
    return cleaned_str

Databricks is pain when it comes to nltk. It doesn't allow stopwords.words("english") to run inside a function while applying udf.

Upvotes: 1

Related Questions