Reputation: 31
I am trying to encode a data table with multiple columns to a given set of categories
ohe1 = OneHotEncoder(categories = [list_names_data_rest.values],dtype = 'int8')
data_rest1 = ohe1.fit_transform(data_rest.values).toarray()
Here, list_names_data_rest.values
is an array of shape (664,). I have 664 unique features and i am trying to encode data_rest
which is (5050,6). After encoding, I am expecting a shape (5050,664)
I am one hot encoding to a pre-defined features set because, I am downloading data sets in chunks (due to ram limitations) and I would like the input shape to my neural network to be consistent
If i use pd.get_dummies
, depending on my data set, I could get different categories and different input shape for my NN
ohe1.fit_transform
does require a shape (n_values, n_features) but, I do not know how to handle this.
Upvotes: 0
Views: 1024
Reputation: 1614
If you wish to use pd.get_dummies
there is an option to iteratively include your encodings for every batch.
For your first batch:
ohe = pd.get_dummies(data_rest, columns=['label_col'])
For every subsequent batch:
for b in batches:
batch_ohe = pd.get_dummies(b, columns=['label_col'])
ohe = pd.concat([ohe, batch_ohe], axis=0)
ohe = ohe.fillna(0)
Upvotes: 0
Reputation: 36
HashingVectorizer
maybe a good solution for your case.It is independent from number of input features , just set initial size big enough.
Upvotes: 0