Reputation: 4571

How to encode categorical values in Python

Given a vocabulary ["NY", "LA", "GA"], how can one encode it in such a way that it becomes:

"NY" = 100
"LA" = 010
"GA" = 001

So if I do a lookup on "NY GA", I get 101

Upvotes: 3

Answers (5)

Open AI - Opting Out

Reputation: 24163

To create a lookup dictionary, reverse the vocabulary, enumerate it, and take the power of 2:

>>> vocabulary = ["NY", "LA", "GA"]
d = dict((word, 2 ** i) for i, word in enumerate(reversed(vocabulary)))
>>> d
{'NY': 4, 'GA': 1, 'LA': 2}

To query the dictionary:

>>> query = "NY GA"
>>> sum(code for word, code in d.iteritems() if word in query.split())
5

If you want it formatted to binary:

>>> '{0:b}'.format(5)
'101'

edit: if you want a 'one liner':

>>> '{0:b}'.format(
        sum(2 ** i
            for i, word in enumerate(reversed(vocabulary))
            if word in query.split()))
'101'

edit2: if you want padding, for example with six 'bits':

>>> '{0:06b}'.format(5)
'000101'

Upvotes: 1

Oliver W.

Reputation: 13459

Another solution using numpy. It looks like you're tyring to binary encode a dictionary, so the code below feels natural to me.

import numpy as np

def to_binary_representation(your_str="NY LA"):
    xs = np.array(["NY", "LA", "GA"])
    ys = 2**np.arange(3)[::-1]
    lookup_table = dict(zip(xs,ys))

    return bin(np.sum([lookup_table[k] for k in your_str.split()]))

It's also not needed to do it in numpy, but it is probably faster in case you have large arrays to work on. np.sum can be replaced by the builtin sum then and the xs and ys can be transformed to non-numpy equivalents.

Upvotes: 1

ugursogukpinar

Reputation: 337

Or you can

    vocabulary = ["NY","LA","GA"]


    i=pow(10,len(vocabulary)-1)
    dictVocab = dict()

    for word in vocabulary:
       dictVocab[word] = i
       i /= 10

    yourStr = "NY LA"
    result = 0
    for word in yourStr.split():
       result += dictVocab[word]

Upvotes: 1

Torxed

Reputation: 23500

vocab = ["NY", "LA", "GA"]
categorystring = '0'*len(vocab)
selectedVocabs = 'NY GA'
for sel in selectedVocabs.split():
    categorystring = list(categorystring)
    categorystring[vocab.index(sel)] = '1'
    categorystring = ''.join(categorystring)

This is the end result of my won testing, turns out Python doesn't support string item assignment, somehow i thought it did.

Personally i think behzad's solution is better, numpy does a better job and is faster.

Upvotes: 1

behzad.nouri

Reputation: 78011

you can use numpy.in1d:

>>> xs = np.array(["NY", "LA", "GA"])
>>> ''.join('1' if f else '0' for f in np.in1d(xs, 'NY GA'.split(' ')))
'101'

or:

>>> ''.join(np.where(np.in1d(xs, 'NY GA'.split(' ')), '1', '0'))
'101'

Upvotes: 1

How to encode categorical values in Python

Answers (5)

Related Questions