user11035754
user11035754

Reputation: 227

Remove meaningless words from dataframe column

The dataframe column contains sentences having few three and two letter words that have no meaning. I want to find all such words in the dataframe column and then remove them from the dataframe column. df-

id      text
1       happy birthday syz
2       vz
3       have a good bne weekend 

I want to 1) find all words with length less than 3. (this shall return syz, vz, bne) 2) remove these words (Note that the stopwords have already been removed so words like "a", "the" aren't existing in the dataframe column now, the above dataframe is just an example)

I tried the below code but it doesn't work

def word_length(text):
    words = []
    for word in text:
        if len(word) <= 3:
            words.append(word)
    return(words)

short_words = df['text'].apply(word_length).sum()

the output should be-

id      text
1       happy birthday 
2       
3       have good weekend 

Upvotes: 0

Views: 1355

Answers (1)

Ivan Sudos
Ivan Sudos

Reputation: 1483

You apply the fuction to a column of sequencies of words whilst the actual data is column of strings (sequencies of symbols) You also should remove .sum() since it is totally redundant.

Rewrite the function you apply in the form:

 def filter_short_words(text):
    return "".join([for w in text.split() if len(w) > 3])

This works.

Upvotes: 1

Related Questions