Reputation: 1449
I do not want terms of length less than 3 or more than say 7.There's a straightforward way of doing this in R , but in Python I am not sure. I tried this, but still doesn't work
from sklearn.feature_extraction.text import CountVectorizer
regex1 = '/^[a-zA-Z]{3,7}$/'
vectorizer = CountVectorizer( analyzer='word',tokenizer= tokenize,stop_words = stopwords,token_pattern = regex1,min_df= 2, max_df = 0.9,max_features = 2000)
vectorizer1 = vectorizer.fit_transform(token_dict.values())
Tried other regex too -
"^[a-zA-Z]{3,7}$"
r'^[a-zA-Z]{3,7}$'
Upvotes: 3
Views: 3343
Reputation: 11032
I think your regex pattern is wrong here. Its of Javscript. It should be like
regex1 = r'^[a-zA-Z]{3,7}$'
Also I am assuming that the regex should match entire string NOT some sub-string. So if a string is like aaaaabbb cc
should be discarded.
If it doesn't you should use word boundary \b
instead of start ^
and end $
anchors. So it should be
regex1 = r'\b[a-zA-Z]{3,7}\b'
Here is a working example
from sklearn.feature_extraction.text import CountVectorizer
regex1 = r'\b[a-zA-Z]{3,7}\b'
token_dict = {123: 'horses', 345: 'ab'}
vectorizer = CountVectorizer(token_pattern = regex1)
vectorizer1 = vectorizer.fit_transform(token_dict.values())
print(vectorizer.get_feature_names())
Output
['horses']
Upvotes: 1
Reputation: 2103
In the documentation of CountVectorizer, it is provided that default token_pattern
takes tokens of 2 or more alphanumeric characters. If you want to change this, pass your own regex
In your case, add token_pattern = "^[a-zA-Z]{3,7}$"
to the options of CountVectorizer
Edit
The regex that should be used is [a-zA-Z]{3,7}
. See Example below -
doc1 = ["Elon Musk is genius", "Are you mad", "Constitutional Ammendments in Indian Parliament",\
"Constitutional Ammendments in Indian Assembly", "House of Cards", "Indian House"]
from sklearn.feature_extraction.text import CountVectorizer
regex1 = '[a-zA-Z]{3,7}'
vectorizer = CountVectorizer(analyzer='word', stop_words = 'english', token_pattern = regex1)
vectorizer1 = vectorizer.fit_transform(doc1)
vectorizer.vocabulary_
Results -
{u'ammendm': 0,
u'assembl': 1,
u'cards': 2,
u'constit': 3,
u'elon': 4,
u'ent': 5,
u'ents': 6,
u'genius': 7,
u'house': 8,
u'indian': 9,
u'mad': 10,
u'musk': 11,
u'parliam': 12,
u'utional': 13}
Upvotes: 2