SVM - passing a string to the CountVectorizer in Python vectorizes each character?

Question

I have a working SVM and the CountVectorizer works fine when the input to the transform function is a list of strings. However, if I just pass one string to it, the vectorizer iterates through each character in the string and vectorizes each one, even though I set the analyzer parameter to word when constructing the CountVectorizer.

for x in range(0,3):
        test=raw_input("Type a message to classify: ")
        v=vectorizer.transform(test).toarray()
        print(v)
        print(len(v))
        print(svm.predict(vectorizer.transform(test).toarray()))

I'm able to fix this issue by changing the second line in the above code to:

test=[raw_input("Type a message to classify: ")]

But this seems strange to have a 1-item list. Isn't there a better way to do this without constructing a list?

David · Accepted Answer

It expects a list or array of documents so when you pass in a single string it assumes that each element of that string is a document (ie: a character).

Try changing svm.predict(vectorizer.transform(test).toarray()) to svm.predict(vectorizer.transform([test]).toarray())

PS: The toarray() part is not going to scale well as you use a real-world corpus. SVMs in sklearn can operate on sparse matrices so I'd drop that part all together.

SVM - passing a string to the CountVectorizer in Python vectorizes each character?

Answers (1)

Related Questions