Reputation: 313
I wrote the following code which works fine on my computer but return null on other computers. Could you please help me to solve this problem.
import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
def preprocess(sentence):
sentence = sentence.lower()
specialChrs={'\xc2',''}
pattern=pattern = r'''(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \$?\d+%?
| \$?\d+(,|.\d+)*
| \w+([-'/]\w+)* # words w/ optional internal hyphens/apostrophe
|/\m+([-'/]\w+)*
'''
tokenizer = RegexpTokenizer(pattern)
tokens = tokenizer.tokenize(sentence)
print tokens
realToken= [e for e in tokens if len(e)>= 3 and len(e)<10]
stopWords = set(stopwords.words('english'))
stop_words = [w for w in realToken if not w in stopWords]
filtered_words = [w for w in stop_words if not w in specialChrs]
print filtered_words
# final_words = [w for w in filtered_words if not w[0]=='0' and w[1]=='x']
return filtered_words
str='I have one generalized rule, where in shellscript I check for all need packages, if any package does not exist, then install it other wise skip to next check. As I need to check and execute few other python as well shellscripts, I am using it. Is using shellscript for this is bad idea?'
preprocess(str)
These are part of my computer output:
['i', 'have', 'one', 'generalized', 'rule', 'where', 'in', 'shellscript', 'i', 'check', 'for', 'all', 'need', .......'idea']
other computers results:
[('', '', '', ''), ('', '', '', ''), ('', '', '', ''), ('', '', '', ''), ('', '', '', ''), ('', '', '', ''), ('', '', '', ''),... ]
my computer information
python 2.7.12 |Anaconda 2.3.0 (64-bit)| (default, Jul 2 2016, 17:42:40) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org
import nltk
print('The nltk version is {}.'.format(nltk.version))
The nltk version is 3.2.1.
my friend computer
python 2.7.12 |Anaconda 4.1.1 (64-bit)| (default, Jun 29 2016, 11:42:40) [MSC v.1500 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org
import nltk
print('The nltk version is {}.'.format(nltk.version))
The nltk version is 3.2.1.
Also, I test my code on another computer and I get the same result.
The information of that computer is:
Python 2.7.3 (default, Oct 26 2016, 21:01:49) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
Upvotes: 1
Views: 165
Reputation: 529
Your problem is answeredin this page
you need to change the regular expression in this way and order to solve your problem.
`pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \$?\d+(?:\.\d+)?%?
| \w+(?:-\w+)* # words with optional internal hyphens
|/\m+(?:[-'/]\w+)*
'''`
Upvotes: 1