hansonmbbop
hansonmbbop

Reputation: 83

How to remove less frequent words from pandas dataframe

How do i remove words that appears less than x time for example words appear less than 3 times in pandas dataframe. I use nltk as non english word removal, however the result is not good. I assume that word apear less than 3 times as non english words.

input_text=["this is th text one tctst","this is text two asdf","this text will be remove"]
def clean_non_english(text):
    text=" ".join(w for w in nltk.wordpunct_tokenize(text)if w.lower() in words or not w.isalpha())
    return text
Dataset['text']=Dataset['text'].apply(lambda x:clean_non_english(x))

Desired output

input_text=["this is text ","this is text ","this is text"]

so the word appear in the list less than 3 times will be removed

Upvotes: 1

Views: 1974

Answers (1)

Ayoub ZAROU
Ayoub ZAROU

Reputation: 2417

Try this

input_text=["this is th text one tctst","this is text two asdf","this text will be remove"]
all_ = [x for y in input_text for x in y.split(' ') ]
a, b = np.unique(all_, return_counts = True)
to_remove = a[b < 3]
output_text = [' '.join(np.array(y.split(' '))[~np.isin(y.split(' '), to_remove)])
                for y in input_text]

Upvotes: 4

Related Questions