anonygrits
anonygrits

Reputation: 1499

using nltk regex example in scikit-learn CountVectorizer

I was trying to use an example from the nltk book for a regex pattern inside the CountVectorizer from scikit-learn. I see examples with simple regex but not with something like this:

pattern = r''' (?x)         # set flag to allow verbose regexps 
    ([A-Z]\.)+          # abbreviations (e.g. U.S.A.)
    | \w+(-\w+)*        # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency & percentages
    | \.\.\.            # ellipses '''

text = 'I love N.Y.C. 100% even with all of its traffic-ridden streets...'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)

This produces:

[(u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'-ridden', u''),
 (u'', u'', u''),
 (u'', u'', u'')]

With nltk, I get something entirely different:

nltk.regexp_tokenize(text,pattern)

['I', 'love', 'N.Y.C.', '100', 'even', 'with', 'all', 'of', 'its', 'traffic-ridden', 'streets', '...']

Is there a way to get the skl CountVectorizer to output the same thing? I was hoping to use some of the other handy features that are incorporated in the same function call.

Upvotes: 2

Views: 2074

Answers (1)

Fred Foo
Fred Foo

Reputation: 363687

TL;DR

from functools import partial
CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))

is a vectorizer that uses the NLTK tokenizer.

Now for the actual problem: apparently nltk.regexp_tokenize does something quite special with its pattern, whereas scikit-learn simply does an re.findall with the pattern you give it, and findall doesn't like this pattern:

In [33]: re.findall(pattern, text)
Out[33]: 
[('', '', ''),
 ('', '', ''),
 ('C.', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '-ridden', ''),
 ('', '', ''),
 ('', '', '')]

You'll either have to rewrite this pattern to make it work in scikit-learn style, or plug the NLTK tokenizer into scikit-learn:

In [41]: from functools import partial

In [42]: v = CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))

In [43]: v.build_analyzer()(text)
Out[43]: 
['I',
 'love',
 'N.Y.C.',
 '100',
 'even',
 'with',
 'all',
 'of',
 'its',
 'traffic-ridden',
 'streets',
 '...']

Upvotes: 3

Related Questions