Reputation: 837
For a machine learning problem I have for every sample a location feature( a state in America), which looks like this: The whole feature vector looks like this:
array(['oklahoma', 'florida', 'idaho', ..., 'pennsylvania', 'alabama',
'washington'], dtype=object)
I cannot directly feed this in a sklearn algorithm and therefore I have to somehow convert this into numerical features, but I don't know how I could do this. What are they best ways to convert these string features? Would ASCII conversion work?
edit: I want my every state to have its own unique numerical value.
Upvotes: 2
Views: 1453
Reputation: 5289
Edit: maybe simple mapping to numbers could be faster and without collisions:
import hashlib
from numpy import array
features = array(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama','washington'], dtype=object)
numbers = range(0, len(features))
num2string = dict(zip(numbers, features))
string2num = dict(zip(features, numbers))
# read the result
for i in num2string:
print "%i => '%s'" % (i, num2string[i])
print "usage test:"
print string2num['oklahoma']
print num2string[string2num['oklahoma']]
You will get a simple sequence of numbers for every item in your array:
0 => 'oklahoma'
1 => 'florida'
2 => 'idaho'
Advantage: simplicity and speed Disadvantage: You'll get different numbers for the same string if you change it's position in array, unlike with hashing the strings.
Usage of hashing
You can hash the string using some well chosen hask algorithm. You have to be careful about number of collisions for your hash function. If two data have the same hash, you would have like a duplicit number in your input. In this example, md5 hash function is used for the purpose:
import hashlib
from numpy import array
def string_to_num(s):
return int(hashlib.md5(s).hexdigest(), 16)
features = array(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama','washington'], dtype=object)
# hash those strings
features_string_for_number = {}
for i in features:
hash_number = string_to_num(i)
features_string_for_number[hash_number]=i
# read the result
for i in features_string_for_number:
print "%i => '%s'" % (i, features_string_for_number[i])
print "usage test:"
print string_to_num('oklahoma')
print features_string_for_number[string_to_num('oklahoma')]
The hashing part is taken from here.
Upvotes: 3
Reputation: 3635
If you just want to turn each city name into a unique numerical value then hash(text)
would work well.
It may be that a more complex hash function is needed as this is not guaranteed to be the same every time Python is run. In fact in Python 3.3 it will be salted differently each time unless you specifically set it up to do otherwise. The hashlib
module contains various different hash algorithms that may suit better.
Upvotes: 3
Reputation: 48317
You can refer to Label preprocessing:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama',
'washington'])
le.classes_
# array(['alabama', 'florida', 'idaho', 'oklahoma', 'pennsylvania',
# 'washington'],
# dtype='|S12')
le.transform(["oklahoma"])
# array([3])
Upvotes: 6