Reputation: 1627
I have installed scikit-learn and I don’t know how to use it. I have some data that looks like this :
{"Tiempo": 2.1, "Brazos": "der", "Puntuacion ": 112, "Nombre": "Alguien1"},
{"Tiempo": 4.1, "Brazos": "izq", "Puntuacion ": 11, "Nombre": "Alguien2"},
{"Tiempo": 3.211, "Brazos": "ambos","Puntuacion ": 1442, "Nombre": "Alguien3"}
And I would like to use some classifiers (like SVM) on them. For what I have seen in the examples, I need to create a dataset. In the examples they always use some predetermined datasets as “iris”. In my case, I suppose that I would need to create my own using my data. In order to do it, I searched and I found that I should use the next functions to obtain the “features” of my dataset:
measurements = [
{'city': 'Dubai', 'temperature': 33.},
{'city': 'London', 'temperature': 12.},
{'city': 'San Fransisco', 'temperature': 18.},
]
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
And in my case, after y use that functions with my data I obtain this:
Once I have this I suppose that I need to obtain my “samples”, however, I don’t know how to do. Could you help me please? And can you tell me if my suppositions are correct?
Upvotes: 2
Views: 975
Reputation: 24752
You are on the right track. Use your data as an example.
from sklearn.feature_extraction import DictVectorizer
# your data
data = [{"Tiempo": 2.1, "Brazos": "der", "Puntuacion ": 112, "Nombre": "Alguien1"}, {"Tiempo": 4.1, "Brazos": "izq", "Puntuacion ": 11, "Nombre": "Alguien2"}, {"Tiempo": 3.211, "Brazos": "ambos","Puntuacion ": 1442, "Nombre": "Alguien3"}]
# make dummy for categorical variables
transformer = DictVectorizer()
transformer.fit_transform(data).toarray()
Out[168]:
array([[ 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
0.0000e+00, 1.1200e+02, 2.1000e+00],
[ 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
0.0000e+00, 1.1000e+01, 4.1000e+00],
[ 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
1.0000e+00, 1.4420e+03, 3.2110e+00]])
transformer.get_feature_names()
Out[170]:
['Brazos=ambos',
'Brazos=der',
'Brazos=izq',
'Nombre=Alguien1',
'Nombre=Alguien2',
'Nombre=Alguien3',
'Puntuacion ',
'Tiempo']
So you see, each record in Out[168]
has 8 columns, the first 3 are categorical dummy for Brazos
(look at the feature names in Out[170]
), the next three are dummy for Nombre
, the last two are continue numeric values Puntuacion
and Tiempo
(which doesn't require any conversion and stay as what it was).
# to continue to fit the model, transform your raw JSON data to numeric value
X = transformer.fit_transform(data)
# import your estimator
from sklearn.naive_bayes import BernoulliNB
estimator = BernoulliNB()
# then start to fit and predict
# NOTE! require your y labels
estimator.fit(X, y)
Upvotes: 2