Jose Ramon
Jose Ramon

Reputation: 5444

Threshold several variables into a binary categorical code in python

I have 3 variables in python (age, gender, race) and I want to create a unique categorical binary code out of them. Firstly, the age is an integer and I want to threshold it for each decade 10-20, 20-30, 30-40 etc., gender 2 values and the race contains 4 values. How can I return a complete categorical code out of the three initial variables?

Upvotes: 1

Views: 726

Answers (3)

Dinesh
Dinesh

Reputation: 1565

You can have a n+1+4 dimensional vector encoding. Given binary code you require, this would be one way of doing it.

You first n entries would encode decade. 1 if it belongs to that decade, 0 else. Next (n+1)th entry could be 1 if male and 0 if female. Similarly for race, 1 if it belongs to that category, 0 else.

Let's say you have up to decades up 100. For 98 year old, male, white, you could do something like [0 0 0 0 0 0 0 0 1 1 0 1 0 0 0] assuming you start from 10 year to 100.

import numpy as np

def encodeAge(i, n):
    ageCode=np.zeros(n)
    ageCode[i]=1
    return ageCode

n=10 # number of decades
dict_race={'w':[1,0,0,0],'b':[0,1,0,0],'a':[0,0,1,0],'l':[0,0,0,1]} # white, black, asian, latino
dict_age={i:encodeAge(i, n) for i in range(n)}
dict_gender={'m':[1],'f':[0]}

def encodeAll(age, gender, race):
    # encode age
    code=[]
    code=np.concatenate([code, dict_age[age//10]])
    # encode gender
    code=np.concatenate([code, dict_gender[gender]])
    # encode race
    code=np.concatenate([code, dict_race[race]])
    return code

e.g. encodeAll(12,'m','w') would return array([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.])

This is slightly longer encoding than other encodings suggested.

Upvotes: 1

Aayush Mahajan
Aayush Mahajan

Reputation: 4033

Here is a method returning a 7 bit code with first 4 bits for age bracket, next 2 for race, and 1 for gender.

4 bits for age imposes the constraint that there can be a total of 16 age brackets only, which is reasonable as it covers the age range 0-159.

The 4 bit age code is simply the 4 bit representation of the integer age//10, which effectively discretizes the age value into ranges: 0-9, 10-19, ..., 150-159

The codes for race and gender are simply hard coded using the race_dict and gender_dict

def get_code(age, race, gender): #returns fixed size 7 bit code
    race_dict = {'African':'00','Hispanic':'01','European':'10','Cantonese':'11'} 
    gender_dict = {'Male':'0','Female':'1'}

    age_code = '{0:b}'.format(age//10).zfill(4)
    race_code = race_dict[race]
    gender_code = gender_dict[gender]

    return  age_code + race_code + gender_code

Input: age:25, race: 'Hispanic', gender: 'Female'

7-bit code: 0010011

If you would like this code to be an integer value between 0-127 for numerical purposes, you can use int(code_str, 2) to achieve that.

EDIT:

to get a numpy array from code string, use np_code_arr = np.fromstring(' '.join(list(code_str)), dtype = int, sep = ' ')

Upvotes: 2

gustavovelascoh
gustavovelascoh

Reputation: 1228

My answer here:

Being age a, gender g and race r,

code = np.array([int(i) for i in "{0:04b}{1:01b}{2:02b}".format(a//10,g,r)])

for age=58, gender=1 and race=3, output will be:

array([0, 1, 0, 1, 1, 1, 1])

Upvotes: 1

Related Questions