Reputation: 1499
I was trying to use an example from the nltk book for a regex pattern inside the CountVectorizer from scikit-learn. I see examples with simple regex but not with something like this:
pattern = r''' (?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations (e.g. U.S.A.)
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency & percentages
| \.\.\. # ellipses '''
text = 'I love N.Y.C. 100% even with all of its traffic-ridden streets...'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)
This produces:
[(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'-ridden', u''),
(u'', u'', u''),
(u'', u'', u'')]
With nltk, I get something entirely different:
nltk.regexp_tokenize(text,pattern)
['I', 'love', 'N.Y.C.', '100', 'even', 'with', 'all', 'of', 'its', 'traffic-ridden', 'streets', '...']
Is there a way to get the skl CountVectorizer to output the same thing? I was hoping to use some of the other handy features that are incorporated in the same function call.
Upvotes: 2
Views: 2074
Reputation: 363687
TL;DR
from functools import partial
CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))
is a vectorizer that uses the NLTK tokenizer.
Now for the actual problem: apparently nltk.regexp_tokenize
does something quite special with its pattern, whereas scikit-learn simply does an re.findall
with the pattern you give it, and findall
doesn't like this pattern:
In [33]: re.findall(pattern, text)
Out[33]:
[('', '', ''),
('', '', ''),
('C.', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '-ridden', ''),
('', '', ''),
('', '', '')]
You'll either have to rewrite this pattern to make it work in scikit-learn style, or plug the NLTK tokenizer into scikit-learn:
In [41]: from functools import partial
In [42]: v = CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))
In [43]: v.build_analyzer()(text)
Out[43]:
['I',
'love',
'N.Y.C.',
'100',
'even',
'with',
'all',
'of',
'its',
'traffic-ridden',
'streets',
'...']
Upvotes: 3