SciKit-learn (python)--Create my dataset

Question

I have installed scikit-learn and I don’t know how to use it. I have some data that looks like this :

{"Tiempo": 2.1,  "Brazos": "der", "Puntuacion ": 112, "Nombre": "Alguien1"},
{"Tiempo": 4.1, "Brazos": "izq", "Puntuacion ": 11, "Nombre": "Alguien2"},
{"Tiempo": 3.211,  "Brazos": "ambos","Puntuacion ": 1442, "Nombre": "Alguien3"}

And I would like to use some classifiers (like SVM) on them. For what I have seen in the examples, I need to create a dataset. In the examples they always use some predetermined datasets as “iris”. In my case, I suppose that I would need to create my own using my data. In order to do it, I searched and I found that I should use the next functions to obtain the “features” of my dataset:

measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Fransisco', 'temperature': 18.},
]

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()

vec.fit_transform(measurements).toarray()
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])

>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']

And in my case, after y use that functions with my data I obtain this: enter image description here

Once I have this I suppose that I need to obtain my “samples”, however, I don’t know how to do. Could you help me please? And can you tell me if my suppositions are correct?

Jianxun Li · Accepted Answer

You are on the right track. Use your data as an example.

from sklearn.feature_extraction import DictVectorizer

# your data
data = [{"Tiempo": 2.1,  "Brazos": "der", "Puntuacion ": 112, "Nombre": "Alguien1"}, {"Tiempo": 4.1, "Brazos": "izq", "Puntuacion ": 11, "Nombre": "Alguien2"}, {"Tiempo": 3.211,  "Brazos": "ambos","Puntuacion ": 1442, "Nombre": "Alguien3"}]

# make dummy for categorical variables
transformer = DictVectorizer()
transformer.fit_transform(data).toarray()

Out[168]: 
array([[  0.0000e+00,   1.0000e+00,   0.0000e+00,   1.0000e+00,   0.0000e+00,
          0.0000e+00,   1.1200e+02,   2.1000e+00],
       [  0.0000e+00,   0.0000e+00,   1.0000e+00,   0.0000e+00,   1.0000e+00,
          0.0000e+00,   1.1000e+01,   4.1000e+00],
       [  1.0000e+00,   0.0000e+00,   0.0000e+00,   0.0000e+00,   0.0000e+00,
          1.0000e+00,   1.4420e+03,   3.2110e+00]])

transformer.get_feature_names()

Out[170]: 
['Brazos=ambos',
 'Brazos=der',
 'Brazos=izq',
 'Nombre=Alguien1',
 'Nombre=Alguien2',
 'Nombre=Alguien3',
 'Puntuacion ',
 'Tiempo']

So you see, each record in Out[168] has 8 columns, the first 3 are categorical dummy for Brazos (look at the feature names in Out[170]), the next three are dummy for Nombre, the last two are continue numeric values Puntuacion and Tiempo (which doesn't require any conversion and stay as what it was).

# to continue to fit the model, transform your raw JSON data to numeric value
X = transformer.fit_transform(data)
# import your estimator
from sklearn.naive_bayes import BernoulliNB
estimator = BernoulliNB()
# then start to fit and predict
# NOTE! require your y labels
estimator.fit(X, y)

SciKit-learn (python)--Create my dataset

Answers (1)

Related Questions