DevEx
DevEx

Reputation: 4561

How to encode categorical values in Python

Given a vocabulary ["NY", "LA", "GA"], how can one encode it in such a way that it becomes:

"NY" = 100
"LA" = 010
"GA" = 001

So if I do a lookup on "NY GA", I get 101

Upvotes: 3

Views: 198

Answers (5)

Open AI - Opting Out
Open AI - Opting Out

Reputation: 24133

To create a lookup dictionary, reverse the vocabulary, enumerate it, and take the power of 2:

>>> vocabulary = ["NY", "LA", "GA"]
d = dict((word, 2 ** i) for i, word in enumerate(reversed(vocabulary)))
>>> d
{'NY': 4, 'GA': 1, 'LA': 2}

To query the dictionary:

>>> query = "NY GA"
>>> sum(code for word, code in d.iteritems() if word in query.split())
5

If you want it formatted to binary:

>>> '{0:b}'.format(5)
'101'

edit: if you want a 'one liner':

>>> '{0:b}'.format(
        sum(2 ** i
            for i, word in enumerate(reversed(vocabulary))
            if word in query.split()))
'101'

edit2: if you want padding, for example with six 'bits':

>>> '{0:06b}'.format(5)
'000101'

Upvotes: 1

Oliver W.
Oliver W.

Reputation: 13459

Another solution using numpy. It looks like you're tyring to binary encode a dictionary, so the code below feels natural to me.

import numpy as np

def to_binary_representation(your_str="NY LA"):
    xs = np.array(["NY", "LA", "GA"])
    ys = 2**np.arange(3)[::-1]
    lookup_table = dict(zip(xs,ys))

    return bin(np.sum([lookup_table[k] for k in your_str.split()]))

It's also not needed to do it in numpy, but it is probably faster in case you have large arrays to work on. np.sum can be replaced by the builtin sum then and the xs and ys can be transformed to non-numpy equivalents.

Upvotes: 1

ugursogukpinar
ugursogukpinar

Reputation: 337

Or you can

    vocabulary = ["NY","LA","GA"]


    i=pow(10,len(vocabulary)-1)
    dictVocab = dict()

    for word in vocabulary:
       dictVocab[word] = i
       i /= 10

    yourStr = "NY LA"
    result = 0
    for word in yourStr.split():
       result += dictVocab[word]

Upvotes: 1

Torxed
Torxed

Reputation: 23480

vocab = ["NY", "LA", "GA"]
categorystring = '0'*len(vocab)
selectedVocabs = 'NY GA'
for sel in selectedVocabs.split():
    categorystring = list(categorystring)
    categorystring[vocab.index(sel)] = '1'
    categorystring = ''.join(categorystring)

This is the end result of my won testing, turns out Python doesn't support string item assignment, somehow i thought it did.

Personally i think behzad's solution is better, numpy does a better job and is faster.

Upvotes: 1

behzad.nouri
behzad.nouri

Reputation: 77941

you can use numpy.in1d:

>>> xs = np.array(["NY", "LA", "GA"])
>>> ''.join('1' if f else '0' for f in np.in1d(xs, 'NY GA'.split(' ')))
'101'

or:

>>> ''.join(np.where(np.in1d(xs, 'NY GA'.split(' ')), '1', '0'))
'101'

Upvotes: 1

Related Questions