WhitneyChia
WhitneyChia

Reputation: 796

Count vectorizing into bigrams for one document, and then taking the average

I'm trying to write a function that takes in one document, count vectorizes the bigrams for that document. This shouldn't have any zeroes, as I'm only doing this to one document at a time. Then I want to take the average of those numbers to get a sense of bigram repetition.

Any problems with this code?

def avg_bigram(x):
    bigram_vectorizer =  CountVectorizer(stop_words='english', ngram_range=(2,2))
    model = bigram_vectorizer.fit_transform(x)
    vector = model.toarray()
    return vector.mean()

I've tested it with text that I know contains more than stop words, and I get back

"empty vocabulary; perhaps the documents only contain stop words"

Thank you for any help!

Upvotes: 0

Views: 723

Answers (1)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96360

CountVectorizer expects a corpus, while you are giving a single doc. Just wrap your doc in a list. E.g:

model = bigram_vectorizer.fit_transform([x])

Upvotes: 1

Related Questions