Scaramouche
Scaramouche

Reputation: 3267

use python's sklearn module with custom dataset

I've never used python before and I find myself in the dire need of using sklearn module in my node.js project for machine learning purposes.

I have been all day trying to understand the code examples in said module and now that I kind of understand how they work, I don't know how to use my own data set.

Each of the built in data sets has its own function (load_iris, load_wine, load_breast_cancer, etc) and they all load data from a .csv and an .rst file. I can't find a function that will allow me to load my own data set. (there's a load_data function but it seems to be for internal use of the previous three I mentioned, cause I can't import it)

How could I do that? What's the proper way to use sklearn with any other data set? Does it always have to be a .csv file? Could it be programmatically provided data (array, object, etc)?

In case it's important: all those built-in data sets have numeric features, my data set has both numeric and string features to be used in the decision tree.

Thanks

Upvotes: 1

Views: 630

Answers (1)

seralouk
seralouk

Reputation: 33147

You can load whatever you want and then use sklearn models.

If you have a .csv file, pandas would be the best option.

import pandas as pd

mydataset = pd.read_csv("dataset.csv")

X = mydataset.values[:,0:10] # let's assume that the first 10 columns are the features/variables
y = mydataset.values[:,11] # let's assume that the 11th column has the target values/classes
...
sklearn_model.fit(X,y)

Similarily, you can load .txt or .xls files.

The important thing in order to use sklearn models is this:

  • X should be always be an 2D array with shape [n_samples, n_variables]
  • y should be the target varible.

Upvotes: 2

Related Questions