Remove meaningless words from dataframe column

Question

The dataframe column contains sentences having few three and two letter words that have no meaning. I want to find all such words in the dataframe column and then remove them from the dataframe column. df-

id      text
1       happy birthday syz
2       vz
3       have a good bne weekend

I want to 1) find all words with length less than 3. (this shall return syz, vz, bne) 2) remove these words (Note that the stopwords have already been removed so words like "a", "the" aren't existing in the dataframe column now, the above dataframe is just an example)

I tried the below code but it doesn't work

def word_length(text):
    words = []
    for word in text:
        if len(word) <= 3:
            words.append(word)
    return(words)

short_words = df['text'].apply(word_length).sum()

the output should be-

id      text
1       happy birthday 
2       
3       have good weekend

Ivan Sudos · Accepted Answer

You apply the fuction to a column of sequencies of words whilst the actual data is column of strings (sequencies of symbols) You also should remove .sum() since it is totally redundant.

Rewrite the function you apply in the form:

 def filter_short_words(text):
    return "".join([for w in text.split() if len(w) > 3])

This works.

Remove meaningless words from dataframe column

Answers (1)

Related Questions