Reputation: 93
Suppose I have a dataset
sex age hours
female 23 900
male 19 304
female 42 222
...
If I use np.loadtxt or np.genfromtxt I can use a converter as a way to assign values to each of the categorical data in the sex column. Is there a way to instead create a one-hot column during the loading process? If not, where should I look to accomplish this?
Upvotes: 1
Views: 2745
Reputation: 375377
With pandas, you can pass the category dtype (which loads in the column cheaply):
In [11]: df = pd.read_csv("my_file.csv", dtype={"sex": "category"})
In [12]: df
Out[12]:
sex age hours
0 female 23 900
1 male 19 304
2 female 42 222
In [13]: df.dtypes
Out[13]:
sex category
age int64
hours int64
dtype: object
Once you have a category you can use get_dummies
:
In [21]: pd.get_dummies(df.sex)
Out[21]:
female male
0 1 0
1 0 1
2 1 0
In [22]: pd.get_dummies(df.sex.cat.codes)
Out[22]:
0 1
0 1 0
1 0 1
2 1 0
Upvotes: 3
Reputation: 53029
Here is a genfromtxt
approach:
import numpy as np
def hot(s):
rec = np.genfromtxt(s, dtype="i8,i4,i4", skip_header=1,
converters={0:{b'male':1<<32, b'female':1}.__getitem__})
return rec.view(np.int32).reshape((-1, 4))
print(hot(<your_file_name>))
Explanation: I think converters are required to return a single value. In order to get two we give the first column a double width dtype and the view cast the resulting recarray.
Upvotes: 0
Reputation: 1878
Have a look at pandas.get_dummies
function.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
Upvotes: 0