Reputation: 2862
I know from using scikit learn i could use,
vectorizer = TfidfVectorizer(min_df=2,ngram_range=(1, 2),norm='l2')
corpus = vectorizer.fit_transform(text)
This piece of code. But how could i do this with gensim?
Upvotes: 3
Views: 1012
Reputation: 574
Using nltk
's everygrams
function is a good way to do this.
from nltk import everygrams
text = 'I like playing baseball'
grams = ['_'.join(grams) for grams in list(everygrams(text, 1, 2))]
grams
>> ['I', 'like', 'playing', 'baseball', 'I_like', 'like_playing', 'playing_baseball']
This will create all uni- and bigrams in the text.
Upvotes: 0
Reputation: 60
I think you could take a look at simple_preprocess from utils
gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15) Convert a document into a list of tokens.
This lowercases, tokenizes, de-accents (optional). – the output are final
tokens = unicode strings, that won’t be processed any further.
Upvotes: 1