Reputation: 11
I am working on the 20newsgroup dataset using Python. After using CountVectorizer on it and then using the gensim api for augmented term frequency. I tried fitting it but am getting this error.
Here is my code:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2000)
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
X_train_counts = count_vect.fit_transform(twenty_train.data)
from gensim.sklearn_api import TfIdfTransformer
model = TfIdfTransformer(smartirs='atn')
tfidf_aug = model.fit_transform(X_train_counts())
After running the above code I get this error:
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
After using getnz() at the end like this.
tfidf_aug = model.fit_transform(X_train_counts().getnnz())
I get this error:
TypeError: 'int' object is not iterable
Upvotes: 1
Views: 263
Reputation: 16966
The input for TfidfTransformer has to be iterator of (int,int) as mentioned here. Hence you have to process the sparse matrix, before it to gensim model.
Try this
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
from gensim.sklearn_api import TfIdfTransformer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2000)
X_train_counts = count_vect.fit_transform(twenty_train.data)
model = TfIdfTransformer(smartirs='atn')
tfidf_aug = model.fit_transform([[(i,j) for i,j in zip(a.data,a.indices)] for a in X_train_counts ])
Upvotes: 1