Reputation: 960
I am trying to to use the example given in this article https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a except that instead of using the 20newsgroups data set that the tutorial uses I am trying to use my own data which consists of text files which are in /home/pi/train/ where each sub directory under train is a label like /home/pi/train/FOOTBALL/ /home/pi/train/BASKETBALL/. I am trying to test one document at a time by putting it in either /home/pi/test/FOOTBALL/ or /home/pi/test/BASKETBALL/ and running the program.
# -*- coding: utf-8 -*-
import sklearn
from pprint import pprint
from sklearn.datasets import load_files
docs_to_train = sklearn.datasets.load_files("/home/pi/train/", description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
pprint(list(docs_to_train.target_names))
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(docs_to_train.data)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),])
text_clf = text_clf.fit(docs_to_train.data, docs_to_train.target)
import numpy as np
docs_to_test = sklearn.datasets.load_files("/home/pi/test/", description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
predicted = text_clf.predict(docs_to_test.data)
np.mean(predicted == docs_to_test.target)
pprint(np.mean(predicted == docs_to_test.target))
If I put a football text document in the /home/pi/test/FOOTBALL/ folder and run the program I get:
['FOOTBALL', 'BASKETBALL']
1.0
If move the same document about football into the /home/pi/test/BASKETBALL/ folder and run the program I get:
['FOOTBALL', 'BASKETBALL']
0.0
Is this how np.mean is supposed to work? Does anyone know what it is trying to tell me?
Upvotes: 2
Views: 263
Reputation: 2359
Having a read through the docs on sklearn's load_files, maybe the problem is in the call X_train_counts = count_vect.fit_transform(docs_to_train.data)
. You may have to explore the structure of the docs_to_train.data object to assess how you access the underlying module data. Unfortunately, the docs aren't all that helpful in terms of data
's structure:
Dictionary-like object, the interesting attributes are: either data, the raw text data to learn, or ‘filenames’, the files holding it, ‘target’, the classification labels (integer index), ‘target_names’, the meaning of the labels, and ‘DESCR’, the full description of the dataset.
It may also be the case that CountVectorizer()
is expecting a single filepath or txt object, and not a data holder filled with many sub-data types.
Upvotes: 1