ypwillygope
ypwillygope

Reputation: 33

Supervised Machine Learning, producing a trained estimator

I have an assignment in which I am supposed to use scikit, numpy and pylab to do the following:

"All of the following should use data from the training_data.csv file provided. training_data gives you a labeled set of integer pairs, representing the scores of two sports teams, with the labels giving the sport.

Write the following functions:

plot_scores() should draw a scatterplot of the data.

predict(dataset) should produce a trained Estimator to guess the sport that resulted in a given score (from a dataset we've withheld, which will be inputs as a 1000 x 2 np array). You can use any algorithm from scikit.

An optional additional function called "preprocess" will process dataset before we it is passed to predict. "

This is what I have done so far:

import numpy as np
import scipy as sp
import pylab as pl
from random import shuffle

def plot_scores():
    k=open('training_data.csv')
    lst=[]
    for triple in k:
        temp=triple.split(',')
        lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
    array=np.array(lst)
    pl.scatter(array[:,0], array[:,1])
    pl.show()

def preprocess(dataset):
    k=open('training_data.csv')
    lst=[]
    for triple in k:
        temp=triple.split(',')
        lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
    shuffle(lst)
    return lst

In preprocess, I shuffled the data because I am supposed to use some of it to train on and some of it to test on, but the original data was not at all random. My question is, how am I supposed to "produce a trained estimator" in predict(dataset)? Is this supposed to be a function that returns another function? And which algorithm would be ideal to classify based on a dataset that looks like this: enter image description here

Upvotes: 0

Views: 89

Answers (3)

Anggi Permana Harianja
Anggi Permana Harianja

Reputation: 195

I think what you are looking for is clf.fit() function, instead creating function that produce another function

Upvotes: 0

Fomalhaut
Fomalhaut

Reputation: 9727

I would recommend you to take a look at this structure:

from random import shuffle
import matplotlib.pyplot as plt
# import a classifier you need


def get_data():
    # open your file and parse data to prepare X as a set of input vectors and Y as a set of targets
    return X, Y


def split_data(X, Y):
    size = len(X)
    indices = range(size)
    shuffle(indices)
    train_indices = indices[:size/2]
    test_indices = indices[size/2:]
    X_train = [X[i] for i in train_indices]
    Y_train = [Y[i] for i in train_indices]
    X_test = [X[i] for i in test_indices]
    Y_test = [Y[i] for i in test_indices]
    return X_train, Y_train, X_test, Y_test


def plot_scatter(Y1, Y2):
    plt.figure()
    plt.scatter(Y1, Y2, 'bo')
    plt.show()


# get data
X, Y = get_data()

# split data
X_train, Y_train, X_test, Y_test = split_data(X, Y)

# create a classifier as an object
classifier = YourImportedClassifier()

# train the classifier, after that the classifier is the trained estimator you need
classifier.train(X_train, Y_train) # or .fit(X_train, Y_train) or another train routine

# make a prediction
Y_prediction = classifier.predict(X_test)

# plot the scatter
plot_scatter(Y_prediction, Y_test)

Upvotes: 0

aleju
aleju

Reputation: 2386

The task likely wants you to train a standard scikit classifier model and return it, i.e. something like

from sklearn.svm import SVC
def predict(dataset):
    X = ... # features, extract from dataset
    y = ... # labels, extract from dataset
    clf = SVC() # create classifier
    clf.fit(X, y) # train
    return clf

Though judging from the name of the function (predict) you should check if it really wants you to return a trained classifier or return predictions for the given dataset argument, as that would be more typical.

As a classifier you can basically use anyone that you like. Your plot looks like your dataset is linearly seperable (there are no colors for the classes, but I assume that the blops are the two classes). On linearly separable data hardly anything will fail. Try SVMs, logistic regression, random forests, naive bayes, ... For extra fun you can try to plot the decision boundaries, see here (which also contains an overview of the available classifiers).

Upvotes: 1

Related Questions