NedStarkOfWinterfell
NedStarkOfWinterfell

Reputation: 5153

scikit-learn add training data

I was looking at the training data available in sklearn at here. As per documentation, it contains 20 classes of documents, based on some newsgroup collection. It does a fairly good job of classifying documents belonging to those categories. However, I need to add more articles for categories, like cricket, football, nuclear physics, etc.

I have set of documents for each class ready, like sports -> cricket, cooking -> French, etc.. How do I add those documents and classes in sklearn so that the interface which now returns 20 classes will return those 20 plus the new ones as well? If there is some training that I need to do, either through SVM or Naive Bayes, where do I do it before adding it to the dataset?

Upvotes: 0

Views: 2699

Answers (1)

tttthomasssss
tttthomasssss

Reputation: 5971

Supposing your additional data has the following directory structure (if not, then this should be your first step, because it will make your life a lot easier as you can use the sklearn API to fetch the data, see here):

additional_data
      |
      |-> sports.cricket
                |
                |-> file1.txt
                |-> file2.txt
                |-> ...
      |
      |-> cooking.french
                |
                |-> file1.txt
                |-> ...
       ...

Moving to python, load up both datasets (supposing your additional data are in the above format and are rooted at /path/to/additional_data)

import os

from sklearn import cross_validation
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import load_files
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import numpy as np

# Note if you have a pre-defined training/testing split in your additional data, you would merge them with the corresponding 'train' and 'test' subsets of 20news
news_data = fetch_20newsgroups(subset='all')
additional_data = load_files(container_path='/path/to/additional_data', encoding='utf-8')

# Both data objects are of type `Bunch` and therefore can be relatively straightforwardly merged

# Merge the two data files
'''
The Bunch object contains the following attributes: `dict_keys(['target_names', 'description', 'DESCR', 'target', 'data', 'filenames'])`
The interesting ones for our purposes are 'data' and 'filenames'
'''
all_filenames = np.concatenate((news_data.filenames, additional_data.filenames)) # filenames is a numpy array
all_data = news_data.data + additional_data.data # data is a standard python list

merged_data_path = '/path/to/merged_data'

'''
The 20newsgroups data has a filename a la '/path/to/scikit_learn_data/20news_home/20news-bydate-test/rec.sport.hockey/54367'
So depending on whether you want to keep the sub directory structure of the train/test splits or not, 
you would either need the last 2 or 3 parts of the path
'''
for content, f in zip(all_data, all_filenames):
    # extract sub path
    sub_path, filename = f.split(os.sep)[-2:]

    # Create output directory if not exists
    p = os.path.join(merged_data_path, sub_path)
    if (not os.path.exists(p)):
        os.makedirs(p)

    # Write data to file
    with open(os.path.join(p, filename), 'w') as out_file:
        out_file.write(content)

# Now that everything is stored at `merged_data_path`, we can use `load_files` to fetch the dataset again, which now includes everything from 20newsgroups and your additional data
all_data = load_files(container_path=merged_data_path)

'''
all_data is yet another `Bunch` object:
    * `data` contains the data
    * `target_names` contains the label names
    * `target contains` the labels in numeric format
    * `filenames` contains the paths of each individual document

thus, running a classifier over the data is straightforward
'''
vec = CountVectorizer()
X = vec.fit_transform(all_data.data)

# We want to create a train/test split for learning and evaluating a classifier (supposing we haven't created a pre-defined train/test split encoded in the directory structure)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, all_data.target, test_size=0.2)

# Create & fit the MNB model
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# Evaluate Accuracy
y_predicted = mnb.predict(X_test)

print('Accuracy: {}'.format(accuracy_score(y_test, y_predicted)))

# Alternatively, the vectorisation and learning can be packaged into a pipeline and serialised for later use
pipeline = Pipeline([('vec', CountVectorizer()), ('mnb', MultinomialNB())])

# Run the vectorizer and train the classifier on all available data
pipeline.fit(all_data.data, all_data.target)

# Serialise the classifier to disk
joblib.dump(pipeline, '/path/to/model_zoo/mnb_pipeline.joblib')

# If you get some more data later on, you can deserialise the model and run them through the pipeline again
p = joblib.load('/path/to/model_zoo/mnb_pipeline.joblib')

docs_new = ['God is love', 'OpenGL on the GPU is fast']

y_predicted = p.predict(docs_new)
print('Predicted labels: {}'.format(np.array(all_data.target_names)[y_predicted]))    

Upvotes: 2

Related Questions