Python. NLP. Preprocessing text

Question

I've got a question with preprocessing my text corpus. I want to delete all non-alphanumeric symbols from text. I have some approaches, but they don't exactly solve problem.

E.x. I've got a sentence:

A B C D ,5 .. AAA55AAA aaa.bbb.ccc

As a result I want to get:

'A' 'B' 'C' 'D' 'AAA' 'AAA' 'aaa' 'bbb' 'ccc'

I've tried NLTK:

from nltk.tokenize import word_tokenize
tokens = word_tokenize(my_sentence)

but it has method isalpha():

words = [word for word in tokens if word.isalpha()]

As a result it will be:

'A', 'B', 'C', 'D'

So it doesn't solve my problem. It deletes all words that contains non-alpha characters

And another:

import string
table = str.maketrans('', '', string.punctuation)
sripped = [w.translate(table) for w in tokens]

but it deletes only punctuation (and all word):

'A', 'B', 'C', 'D', '5', '', 'AAA55AAA'

Is there any solution using NLTK or smth. else? Or only way to solve problem - using regex for each word? (really do not want to do this because regex works for a long time especially on huge file)

olinox14 · Accepted Answer

Could you use a regex?

import re
rx = re.compile(r'[^a-zA-Z]')

res = rx.sub(" ", "AAA BB2BB")

print(res)  # >> AAA BB BB

What it does: [^a-zA-Z] will match any non-alpha caracters and sub() will replace them by a space

Python. NLP. Preprocessing text

Answers (1)

Related Questions