Reputation:
I've got a question with preprocessing my text corpus. I want to delete all non-alphanumeric symbols from text. I have some approaches, but they don't exactly solve problem.
E.x. I've got a sentence:
A B C D ,5 .. AAA55AAA aaa.bbb.ccc
As a result I want to get:
'A' 'B' 'C' 'D' 'AAA' 'AAA' 'aaa' 'bbb' 'ccc'
I've tried NLTK:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(my_sentence)
but it has method isalpha():
words = [word for word in tokens if word.isalpha()]
As a result it will be:
'A', 'B', 'C', 'D'
So it doesn't solve my problem. It deletes all words that contains non-alpha characters
And another:
import string
table = str.maketrans('', '', string.punctuation)
sripped = [w.translate(table) for w in tokens]
but it deletes only punctuation (and all word):
'A', 'B', 'C', 'D', '5', '', 'AAA55AAA'
Is there any solution using NLTK or smth. else? Or only way to solve problem - using regex for each word? (really do not want to do this because regex works for a long time especially on huge file)
Upvotes: 0
Views: 499