Reputation: 5444
I have 3 variables in python (age, gender, race) and I want to create a unique categorical binary code out of them. Firstly, the age is an integer and I want to threshold it for each decade 10-20, 20-30, 30-40 etc., gender 2 values and the race contains 4 values. How can I return a complete categorical code out of the three initial variables?
Upvotes: 1
Views: 726
Reputation: 1565
You can have a n+1+4
dimensional vector encoding. Given binary code you require, this would be one way of doing it.
You first n
entries would encode decade. 1
if it belongs to that decade, 0
else. Next (n+1)th
entry could be 1
if male and 0
if female. Similarly for race, 1
if it belongs to that category, 0
else.
Let's say you have up to decades up 100. For 98 year old, male, white, you could do something like [0 0 0 0 0 0 0 0 1 1 0 1 0 0 0]
assuming you start from 10
year to 100
.
import numpy as np
def encodeAge(i, n):
ageCode=np.zeros(n)
ageCode[i]=1
return ageCode
n=10 # number of decades
dict_race={'w':[1,0,0,0],'b':[0,1,0,0],'a':[0,0,1,0],'l':[0,0,0,1]} # white, black, asian, latino
dict_age={i:encodeAge(i, n) for i in range(n)}
dict_gender={'m':[1],'f':[0]}
def encodeAll(age, gender, race):
# encode age
code=[]
code=np.concatenate([code, dict_age[age//10]])
# encode gender
code=np.concatenate([code, dict_gender[gender]])
# encode race
code=np.concatenate([code, dict_race[race]])
return code
e.g. encodeAll(12,'m','w')
would return array([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.])
This is slightly longer encoding than other encodings suggested.
Upvotes: 1
Reputation: 4033
Here is a method returning a 7 bit code with first 4 bits for age bracket, next 2 for race, and 1 for gender.
4 bits for age imposes the constraint that there can be a total of 16 age brackets only, which is reasonable as it covers the age range 0-159.
The 4 bit age code is simply the 4 bit representation of the integer age//10
, which effectively discretizes the age value into ranges: 0-9, 10-19, ..., 150-159
The codes for race and gender are simply hard coded using the race_dict
and gender_dict
def get_code(age, race, gender): #returns fixed size 7 bit code
race_dict = {'African':'00','Hispanic':'01','European':'10','Cantonese':'11'}
gender_dict = {'Male':'0','Female':'1'}
age_code = '{0:b}'.format(age//10).zfill(4)
race_code = race_dict[race]
gender_code = gender_dict[gender]
return age_code + race_code + gender_code
Input: age:25, race: 'Hispanic', gender: 'Female'
7-bit code: 0010011
If you would like this code to be an integer value between 0-127 for numerical purposes, you can use int(code_str, 2)
to achieve that.
EDIT:
to get a numpy array from code string, use np_code_arr = np.fromstring(' '.join(list(code_str)), dtype = int, sep = ' ')
Upvotes: 2
Reputation: 1228
My answer here:
Being age a, gender g and race r,
code = np.array([int(i) for i in "{0:04b}{1:01b}{2:02b}".format(a//10,g,r)])
for age=58, gender=1 and race=3, output will be:
array([0, 1, 0, 1, 1, 1, 1])
Upvotes: 1