Reputation: 163
I am quite new to scikit-learn and I am trying to use this package to make predictions on the income data. This maybe a duplicate question as I saw another post on this but I am looking for an easy example to understand what's expected from scikit-learn estimators.
The data I have is of the following structure where many features are categorical (eg: workclass, education..)
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
Example records:
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
I am having a hard time handling the categorical features as most of the models in sckit-learn expect all features to be numbers? They do provide some classes to transform/encode such features (like Onehotencoder, DictVectorizer) but I cannot find a way to use these on my data. I know there are quite a number of steps involved here before I fully encode them to numbers but I am just wondering if anybody knows a simpler and efficient(since there are too many such features) way that can be understood with an example. I vaguely know DictVectorizer is the way to go but need help in how to proceed here.
Upvotes: 3
Views: 1624
Reputation: 363838
Here's some example code using DictVectorizer
. First, let set up some data in the Python shell. I leave reading from a file up to you.
>>> features = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
... "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country"]
>>> input_text = """38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
... 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
... 30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
... """
Now, parse these:
>>> for ln in input_text.splitlines():
... values = ln.split()
... y.append(values[-1])
... d = dict(zip(features, values[:-1]))
... samples.append(d)
What have we got now? Let's check:
>>> from pprint import pprint
>>> pprint(samples[0])
{'age': '38',
'capital-gain': '0',
'capital-loss': '0',
'education': 'HS-grad',
'education-num': '9',
'fnlwgt': '215646',
'hours-per-week': '40',
'marital-status': 'Divorced',
'native-country': 'United-States',
'occupation': 'Handlers-cleaners',
'race': 'White',
'relationship': 'Not-in-family',
'sex': 'Male',
'workclass': 'Private'}
>>> print(y)
['<=50K', '<=50K', '>50K']
These samples
are ready for DictVectorizer
, so pass them:
>>> from sklearn.feature_extraction import DictVectorizer
>>> dv = DictVectorizer()
>>> X = dv.fit_transform(samples)
>>> X
<3x29 sparse matrix of type '<type 'numpy.float64'>'
with 42 stored elements in Compressed Sparse Row format>
Et voila, you have X
and y
that can be passed to an estimator, provided it supports sparse matrices. (Otherwise, pass sparse=False
to the DictVectorizer
constructor.)
Test samples can similarly be passed to DictVectorizer.transform
; if there are feature/value combinations in the test set that do not occur in the training set, these will simply be ignored (because the learned model cannot do anything sensible with them anyway).
Upvotes: 6