Reputation: 174
I want to remove meaningless or invalid data on cell. (A combination of meaningless alphabets or only numbers in cells)
My data is below.
ID A1 B1 C1
1 apple adfs banana
2 I love you mom 111
3 zaaaaf dad 348080
Expected output as below.
ID A1 B1 C1
1 apple banana
2 I love you mom
3 dad
How can I this?
Upvotes: 0
Views: 2357
Reputation: 862406
You can compare values to some dictonary, here from ntlk
and if not match remove values, but still is possible some values are removed like mom
if not exist in dictionary ntlk:
import nltk
words = set(nltk.corpus.words.words())
#https://stackoverflow.com/a/41290205
f = lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in words)
#apply only for object columns (obviously strings)
cols = df.select_dtypes(object).columns
df[cols] = df[cols].applymap(f)
print (df)
ID A1 B1 C1
0 1 apple banana
1 2 I love you
2 3 dad
Upvotes: 2