Reputation: 2618
I am building a multilabel text classification program and I am trying to use OneVsRestClassifier+XGBClassifier to classify the text. Initially I used Sklearn's Tf-Idf Vectorization to vectorize the texts, which worked without error. Now I am using Gensim's Word2Vec to vectorize the texts. When I feed the vectorized data into the OneVsRestClassifier+XGBClassifier however, I get the following error on the line where I split the test and training data:
TypeError: Singleton array array(, dtype=object) cannot be considered a valid collection.
I have tried converting the vectorized data into a feature array (np.array), but that hasn't seemed to work. Below is my code:
x = np.array(Word2Vec(textList, size=120, window=6, min_count=5, workers=7, iter=15))
vectorizer2 = MultiLabelBinarizer()
vectorizer2.fit(tagList)
y = vectorizer2.transform(tagList)
# Split test data and convert test data to arrays
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.20)
The variables textList
and tagList
are a list of strings (textual descriptions I am trying to classify).
Upvotes: 3
Views: 14200
Reputation: 121
x
here becomes a numpy array conversion of the gensim.models.word2vec.Word2Vec
object -- it is not actually the word2vec representations of textList
that are returned.
Presumably, what you want to return is the corresponding vector for each word in a document (for a single vector representing each document, it would be better to use Doc2Vec).
For a set of documents in which the most verbose document contains n
words, then, each document would be represented by an n * 120 matrix.
Unoptimized code for illustrative purposes:
import numpy as np
model = x = Word2Vec(textList, size=120, window=6,
min_count=5, workers=7, iter=15)
documents = []
for document in textList:
word_vectors = []
for word in document.split(' '): # or your logic for separating tokens
word_vectors.append(model.wv[word])
documents.append(np.concatenate(word_vectors))
# resulting in an n * 120 -- that is, `Word2Vec:size`-- array
document_matrix = np.concatenate(documents)
Upvotes: 2