Reputation: 796
I'm trying to write a function that takes in one document, count vectorizes the bigrams for that document. This shouldn't have any zeroes, as I'm only doing this to one document at a time. Then I want to take the average of those numbers to get a sense of bigram repetition.
Any problems with this code?
def avg_bigram(x):
bigram_vectorizer = CountVectorizer(stop_words='english', ngram_range=(2,2))
model = bigram_vectorizer.fit_transform(x)
vector = model.toarray()
return vector.mean()
I've tested it with text that I know contains more than stop words, and I get back
"empty vocabulary; perhaps the documents only contain stop words"
Thank you for any help!
Upvotes: 0
Views: 723
Reputation: 96360
CountVectorizer
expects a corpus, while you are giving a single doc. Just wrap your doc in a list
. E.g:
model = bigram_vectorizer.fit_transform([x])
Upvotes: 1