Learner
Learner

Reputation: 837

Looking for a way to preprocess string features

For a machine learning problem I have for every sample a location feature( a state in America), which looks like this: The whole feature vector looks like this:

array(['oklahoma', 'florida', 'idaho', ..., 'pennsylvania', 'alabama',
   'washington'], dtype=object)

I cannot directly feed this in a sklearn algorithm and therefore I have to somehow convert this into numerical features, but I don't know how I could do this. What are they best ways to convert these string features? Would ASCII conversion work?

edit: I want my every state to have its own unique numerical value.

Upvotes: 2

Views: 1453

Answers (3)

nio
nio

Reputation: 5289

Edit: maybe simple mapping to numbers could be faster and without collisions:

import hashlib
from numpy import array

features = array(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama','washington'], dtype=object)

numbers = range(0, len(features))
num2string = dict(zip(numbers, features))
string2num = dict(zip(features, numbers))

# read the result
for i in num2string:
    print "%i => '%s'" % (i, num2string[i])

print "usage test:"
print string2num['oklahoma']
print num2string[string2num['oklahoma']]

You will get a simple sequence of numbers for every item in your array:

0 => 'oklahoma'
1 => 'florida'
2 => 'idaho'

Advantage: simplicity and speed Disadvantage: You'll get different numbers for the same string if you change it's position in array, unlike with hashing the strings.

Usage of hashing

You can hash the string using some well chosen hask algorithm. You have to be careful about number of collisions for your hash function. If two data have the same hash, you would have like a duplicit number in your input. In this example, md5 hash function is used for the purpose:

import hashlib
from numpy import array


def string_to_num(s):
    return int(hashlib.md5(s).hexdigest(), 16)

features = array(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama','washington'], dtype=object)

# hash those strings
features_string_for_number = {}
for i in features:
    hash_number = string_to_num(i)
    features_string_for_number[hash_number]=i

# read the result
for i in features_string_for_number:
    print "%i => '%s'" % (i, features_string_for_number[i])

print "usage test:"
print string_to_num('oklahoma')
print features_string_for_number[string_to_num('oklahoma')]

The hashing part is taken from here.

Upvotes: 3

neil
neil

Reputation: 3635

If you just want to turn each city name into a unique numerical value then hash(text) would work well.

It may be that a more complex hash function is needed as this is not guaranteed to be the same every time Python is run. In fact in Python 3.3 it will be salted differently each time unless you specifically set it up to do otherwise. The hashlib module contains various different hash algorithms that may suit better.

Upvotes: 3

alko
alko

Reputation: 48317

You can refer to Label preprocessing:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama',
     'washington'])
le.classes_
# array(['alabama', 'florida', 'idaho', 'oklahoma', 'pennsylvania',
#         'washington'],
#       dtype='|S12')
le.transform(["oklahoma"])
# array([3])

Upvotes: 6

Related Questions