user3665906
user3665906

Reputation: 195

Remove different meaningless tokens from text in Python

I am new to topic modeling. After doing tokenizing using NLTK, for example I have following tokens:

'1-in', '1-joerg', '1-justine', '1-lleyton', '1-million', '1-nil', '1of','00pm-ish', '01.41', '01.57','0-40', '0-40f',

I believe they are meaningless and can not help me in the rest of my process. Is it correct? If so, is there anyone who has an idea about regular expression or ... that should be used to remove these tokens from my token list(they are so different and I could not think of a regexp for this purpose)?

Upvotes: 1

Views: 2348

Answers (1)

john smith
john smith

Reputation: 50

I've found the easiest way to get rid of word I don't want in a string is to replace them with a blank space using csv.

import re

def word_replace(text, replace_dict):
rc = re.compile(r"[A-Za-z_]\w*")

def translate(match):
    word = match.group(0).lower()
    print(word)
    return replace_dict.get(word, word)

return rc.sub(translate, text)

old_text = open('C:/the_file_with_this_string').read()

replace_dict = {
"unwanted_string1" : '',
"unwanted_string2" : '',
"unwanted_string3" : '',
"unwanted_string4" : '',
"unwanted_string5" : '',
"unwanted_string6" : '',
"unwanted_string7" : '',
"unwanted_string8" : '',
"unwanted_string9" : '',
"unwanted_string10" : ''
 }

output = word_replace(old_text, replace_dict)
f = open("C:/the_file_with_this_string", 'w')
f.write(output)
print(output)

replace 'C:/the_file_with_this_string' with the path to the file with the string

replace unwanted_string(#) with the string you want to get rid of

Upvotes: 1

Related Questions