Reputation: 8798
Trying to convert string into numeric vector,
### Clean the string
def names_to_words(names):
print('a')
words = re.sub("[^a-zA-Z]"," ",names).lower().split()
print('b')
return words
### Vectorization
def Vectorizer():
Vectorizer= CountVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
return Vectorizer
### Test a string
s = 'abc...'
r = names_to_words(s)
feature = Vectorizer().fit_transform(r).toarray()
But when I encoutered:
['g', 'o', 'm', 'd']
There's error:
ValueError: empty vocabulary; perhaps the documents only contain stop words
It seems there's a problem with such single-letter string. what should I do? Thx
Upvotes: 10
Views: 4627
Reputation: 36599
The default token_pattern regexp in CountVectorizer selects words which have atleast 2 chars as stated in documentation:
token_pattern : string
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
From the source code of CountVectorizer it is r"(?u)\b\w\w+\b
Change it to r"(?u)\b\w+\b
to include 1 letter words.
Change your code to the following (include the token_pattern
parameter with above suggestion):
Vectorizer= CountVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000,
token_pattern = r"(?u)\b\w+\b")
Upvotes: 14