Nigel Yong
Nigel Yong

Reputation: 68

How to reproduce the default sklearn CountVectorizer tokenization with only regex?

I don't want to use CountVectorizer but try to reproduce it's tokenizing. I know it removes some special characters and puts them in lower case. I tried this regex r'[\W_]+' and is having ' ' as delimiter but still I'm not able to reproduce it. Any ideas?

Upvotes: 1

Views: 527

Answers (1)

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25199

Use instead '(?u)\\b\\w\\w+\\b' regex.

To replicate:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

text = "This a simple example. And this is another."
text_transformed = cv.fit_transform([text])
vocab = sorted(cv.vocabulary_)
counts = text_transformed.toarray()
print(pd.DataFrame(counts, columns = vocab))

   and  another  example  is  simple  this
0    1        1        1   1       1     2

One would do:

import re
from collections import Counter

regex = re.compile('(?u)\\b\\w\\w+\\b')
tokens = re.findall(regex, text.lower()) # notice lowercase=True param
vocab = sorted(set(tokens))
counts = Counter(tokens)
counts = [counts[key] for key in sorted(counts.keys())]
#vocab, counts = list(zip(*sorted(counter.items()))) # one liner with asterisk unpacking
print(pd.DataFrame([counts], columns = vocab))

   and  another  example  is  simple  this
0    1        1        1   1       1     2

Explanation:

CountVectorizer uses token_pattern='(?u)\\b\\w\\w+\\b' param that is the regex pattern to extract tokens from text:

print(cv)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

By supplying this regex to re.findall you'll achieve a similar tokenization, and by counting further you'll get the output of CountVectorizer

Upvotes: 2

Related Questions