user3092781
user3092781

Reputation: 313

Different results from one python code related to NLTK library on different computers

I wrote the following code which works fine on my computer but return null on other computers. Could you please help me to solve this problem.

import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

def preprocess(sentence):
    sentence = sentence.lower()
    specialChrs={'\xc2',''} 
    pattern=pattern = r'''(?x)               # set flag to allow verbose regexps
              ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
              | \$?\d+%?
              | \$?\d+(,|.\d+)*
              | \w+([-'/]\w+)*    # words w/ optional internal hyphens/apostrophe
              |/\m+([-'/]\w+)*
            '''
    tokenizer = RegexpTokenizer(pattern)
    tokens = tokenizer.tokenize(sentence)
    print tokens
    realToken= [e for e in tokens if  len(e)>= 3 and len(e)<10]
    stopWords = set(stopwords.words('english'))
    stop_words = [w for w in realToken if not w in stopWords]
    filtered_words = [w for w in stop_words if not w in specialChrs]
    print filtered_words
   # final_words = [w for w in filtered_words if not w[0]=='0' and w[1]=='x']
    return filtered_words


str='I have one generalized rule, where in shellscript I check for all need packages, if any package does not exist, then install it other wise skip to next check. As I need to check and execute few other python as well shellscripts, I am using it. Is using shellscript for this is bad idea?'
preprocess(str)

These are part of my computer output:

['i', 'have', 'one', 'generalized', 'rule', 'where', 'in', 'shellscript', 'i', 'check', 'for', 'all', 'need', .......'idea']

other computers results:

[('', '', '', ''), ('', '', '', ''), ('', '', '', ''), ('', '', '', ''), ('', '', '', ''), ('', '', '', ''), ('', '', '', ''),... ]

my computer information

python 2.7.12 |Anaconda 2.3.0 (64-bit)| (default, Jul 2 2016, 17:42:40) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org

import nltk

print('The nltk version is {}.'.format(nltk.version))

The nltk version is 3.2.1.

my friend computer

python 2.7.12 |Anaconda 4.1.1 (64-bit)| (default, Jun 29 2016, 11:42:40) [MSC v.1500 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org

import nltk

print('The nltk version is {}.'.format(nltk.version))

The nltk version is 3.2.1.

Also, I test my code on another computer and I get the same result.

The information of that computer is:

Python 2.7.3 (default, Oct 26 2016, 21:01:49) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.

Upvotes: 1

Views: 165

Answers (1)

user3487667
user3487667

Reputation: 529

Your problem is answeredin this page

you need to change the regular expression in this way and order to solve your problem.

`pattern = r'''(?x)          # set flag to allow verbose regexps
            (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
         | \$?\d+(?:\.\d+)?%?
         | \w+(?:-\w+)*        # words with optional internal hyphens
         |/\m+(?:[-'/]\w+)*
      '''`

Upvotes: 1

Related Questions