user184074
user184074

Reputation: 93

How can I one-hot encode data with numpy?

Suppose I have a dataset

sex    age hours
female 23  900
male   19  304
female 42  222
      ...

If I use np.loadtxt or np.genfromtxt I can use a converter as a way to assign values to each of the categorical data in the sex column. Is there a way to instead create a one-hot column during the loading process? If not, where should I look to accomplish this?

Upvotes: 1

Views: 2745

Answers (3)

Andy Hayden
Andy Hayden

Reputation: 375377

With pandas, you can pass the category dtype (which loads in the column cheaply):

In [11]: df = pd.read_csv("my_file.csv", dtype={"sex": "category"})

In [12]: df
Out[12]:
      sex  age  hours
0  female   23    900
1    male   19    304
2  female   42    222

In [13]: df.dtypes
Out[13]:
sex      category
age         int64
hours       int64
dtype: object

Once you have a category you can use get_dummies:

In [21]: pd.get_dummies(df.sex)
Out[21]:
   female  male
0       1     0
1       0     1
2       1     0

In [22]: pd.get_dummies(df.sex.cat.codes)
Out[22]:
   0  1
0  1  0
1  0  1
2  1  0

Upvotes: 3

Paul Panzer
Paul Panzer

Reputation: 53029

Here is a genfromtxt approach:

import numpy as np

def hot(s):
    rec = np.genfromtxt(s, dtype="i8,i4,i4", skip_header=1,
                        converters={0:{b'male':1<<32, b'female':1}.__getitem__})
    return rec.view(np.int32).reshape((-1, 4))

print(hot(<your_file_name>))

Explanation: I think converters are required to return a single value. In order to get two we give the first column a double width dtype and the view cast the resulting recarray.

Upvotes: 0

dgumo
dgumo

Reputation: 1878

Have a look at pandas.get_dummies function.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

Upvotes: 0

Related Questions