Nipun Alahakoon
Nipun Alahakoon

Reputation: 2862

how to tokenize a set of documents into unigram + bigram bagofwords using gensim?

I know from using scikit learn i could use,

vectorizer = TfidfVectorizer(min_df=2,ngram_range=(1, 2),norm='l2')

corpus = vectorizer.fit_transform(text)

This piece of code. But how could i do this with gensim?

Upvotes: 3

Views: 1012

Answers (2)

Desi Pilla
Desi Pilla

Reputation: 574

Using nltk's everygrams function is a good way to do this.

from nltk import everygrams

text = 'I like playing baseball'
grams = ['_'.join(grams) for grams in list(everygrams(text, 1, 2))]
grams

>> ['I', 'like', 'playing', 'baseball', 'I_like', 'like_playing', 'playing_baseball']

This will create all uni- and bigrams in the text.

Upvotes: 0

Peter Krejzl
Peter Krejzl

Reputation: 60

I think you could take a look at simple_preprocess from utils

gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15) Convert a document into a list of tokens.

This lowercases, tokenizes, de-accents (optional). – the output are final

tokens = unicode strings, that won’t be processed any further.

Upvotes: 1

Related Questions