Reputation: 469
I want to do n-grams method but letter by letter
Normal N-grams:
sentence : He want to watch football match
result:
he, he want, want, want to , to , to watch , watch , watch football , football, football match, match
I want to do this but letter by letter:
word : Angela
result:
a, an, n , ng , g , ge, e ,el, l , la ,a
This is my code using Sklearn
, but it is still word-by-word not letter-by-letter:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 100),token_pattern = r"(?u)\b\w+\b")
corpus = ['Angel','Angelica','John','Johnson']
X = vectorizer.fit_transform(corpus)
analyze = vectorizer.build_analyzer()
print(vectorizer.get_feature_names())
print(vectorizer.transform(['Angela']).toarray())
Upvotes: 3
Views: 3136
Reputation: 36617
There is an 'analyzer'
param which does what you want.
According to the documentation:-
analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable
Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
By default, it is set to word, which you can change.
Just do:
vectorizer = CountVectorizer(ngram_range=(1, 100),
token_pattern = r"(?u)\b\w+\b",
analyzer='char')
Upvotes: 7