Samden Lepcha
Samden Lepcha

Reputation: 11

Augmented Frequency on 20newsgroup dataset.TypeError: 'int' object is not iterable

I am working on the 20newsgroup dataset using Python. After using CountVectorizer on it and then using the gensim api for augmented term frequency. I tried fitting it but am getting this error.

Here is my code:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2000)
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
X_train_counts = count_vect.fit_transform(twenty_train.data)
from gensim.sklearn_api import TfIdfTransformer
model = TfIdfTransformer(smartirs='atn')
tfidf_aug = model.fit_transform(X_train_counts())

After running the above code I get this error:

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

After using getnz() at the end like this.

tfidf_aug = model.fit_transform(X_train_counts().getnnz())

I get this error:

TypeError: 'int' object is not iterable

Upvotes: 1

Views: 263

Answers (1)

Venkatachalam
Venkatachalam

Reputation: 16966

The input for TfidfTransformer has to be iterator of (int,int) as mentioned here. Hence you have to process the sparse matrix, before it to gensim model.

Try this

from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
from gensim.sklearn_api import TfIdfTransformer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2000)

X_train_counts = count_vect.fit_transform(twenty_train.data)

model = TfIdfTransformer(smartirs='atn')
tfidf_aug = model.fit_transform([[(i,j) for i,j in zip(a.data,a.indices)] for a in X_train_counts ])

Upvotes: 1

Related Questions