Reputation: 837
I am extracting features out of a text corpus, and I am using a td-fidf vectorizer and truncated singular value decomposition from scikit-learn in order to achieve that. However, since the algorithm I want to try out requires dense matrices and the vectorizer returns sparse matrices I need to convert those matrices to dense arrays. But, whenever I try to convert those arrays I get an error telling me that my numpy array object has no atribute "toarray". What am I doing wrong?
The function:
def feature_extraction(train,train_test,test_set):
vectorizer = TfidfVectorizer(min_df = 3,strip_accents = "unicode",analyzer = "word",token_pattern = r'\w{1,}',ngram_range = (1,2))
print("fitting Vectorizer")
vectorizer.fit(train)
print("transforming text")
train = vectorizer.transform(train)
train_test = vectorizer.transform(train_test)
test_set = vectorizer.transform(test_set)
print("Dimensionality reduction")
svd = TruncatedSVD(n_components = 100)
svd.fit(train)
train = svd.transform(train)
train_test = svd.transform(train_test)
test_set = svd.transform(test_set)
print("convert to dense array")
train = train.toarray()
test_set = test_set.toarray()
train_test = train_test.toarray()
print(train.shape)
return train,train_test,test_set
traceback:
Traceback (most recent call last):
File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 24, in <module>
x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set)
File "C:\Users\Anonymous\workspace\final_submission\src\Preprocessing.py", line 57, in feature_extraction
train = train.toarray()
AttributeError: 'numpy.ndarray' object has no attribute 'toarray'
Update: Willy pointed out that my assumption of the matrix being sparse might be wrong. So I tried feeding my data to my algorithm with dimensionality reduction and it actually worked without any conversion, however when I excluded dimensionality reduction, which gave me around 53k features I get the following error:
Traceback (most recent call last):
File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 28, in <module>
result = bayesian_ridge(x_train,x_test,y_train,y_test,test_set)
File "C:\Users\Anonymous\workspace\final_submission\src\Algorithms.py", line 84, in bayesian_ridge
algo = algo.fit(x_train,y_train[:,i])
File "C:\Python27\lib\site-packages\sklearn\linear_model\bayes.py", line 136, in fit
dtype=np.float)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 220, in check_arrays
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Can someone explain this?
Update2
As requested, I'll give all the code involved. Since it is scattered over different files I'll just post it in steps. For clarity I'll leave all the module imports out.
This is how I preprocess my code:
def regexp(data):
for row in range(len(data)):
data[row] = re.sub(r'[\W_]+'," ",data[row])
return data
def clean_the_text(data):
alist = []
data = nltk.word_tokenize(data)
for j in data:
j = j.lower()
alist.append(j.rstrip('\n'))
alist = " ".join(alist)
return alist
def loop_data(data):
for i in range(len(data)):
data[i] = clean_the_text(data[i])
return data
if __name__ == "__main__":
print("loading train")
train_text = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"train.csv")))[:,1]))))
print("loading test_set")
test_set = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"test.csv")))[:,1]))))
After splitting my train_set into a x_train and a x_test for cross_validation I transform my data using the feature_extraction function above.
x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set)
Finally I feed them into my algorithm
def bayesian_ridge(x_train,x_test,y_train,y_test,test_set):
algo = linear_model.BayesianRidge()
algo = algo.fit(x_train,y_train)
pred = algo.predict(x_test)
error = pred - y_test
result.append(algo.predict(test_set))
print("Bayes_error: ",cross_val(error))
return result
Upvotes: 5
Views: 41131
Reputation: 363487
TruncatedSVD.transform
returns an array, not a sparse matrix. In fact, in the present version of scikit-learn, only the vectorizers return sparse matrices.
Upvotes: 2