purplecollar
purplecollar

Reputation: 174

How to find and remove invalid / meaningless text in python?

I want to remove meaningless or invalid data on cell. (A combination of meaningless alphabets or only numbers in cells)

My data is below.

ID         A1           B1          C1
1          apple        adfs        banana
2          I love you   mom         111
3          zaaaaf       dad         348080

Expected output as below.

ID         A1           B1          C1
1          apple                    banana
2          I love you   mom         
3                       dad          

How can I this?

Upvotes: 0

Views: 2357

Answers (1)

jezrael
jezrael

Reputation: 862406

You can compare values to some dictonary, here from ntlk and if not match remove values, but still is possible some values are removed like mom if not exist in dictionary ntlk:

import nltk
words = set(nltk.corpus.words.words())

#https://stackoverflow.com/a/41290205
f = lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in words)

#apply only for object columns (obviously strings)
cols = df.select_dtypes(object).columns
df[cols] = df[cols].applymap(f)
print (df)
   ID          A1   B1      C1
0   1       apple       banana
1   2  I love you             
2   3              dad        

Upvotes: 2

Related Questions