Varoon
Varoon

Reputation: 31

OneHotEncoder Multiple Columns

I am trying to encode a data table with multiple columns to a given set of categories

ohe1 = OneHotEncoder(categories = [list_names_data_rest.values],dtype = 'int8')
data_rest1 = ohe1.fit_transform(data_rest.values).toarray()

Here, list_names_data_rest.values is an array of shape (664,). I have 664 unique features and i am trying to encode data_rest which is (5050,6). After encoding, I am expecting a shape (5050,664)

I am one hot encoding to a pre-defined features set because, I am downloading data sets in chunks (due to ram limitations) and I would like the input shape to my neural network to be consistent

If i use pd.get_dummies, depending on my data set, I could get different categories and different input shape for my NN

ohe1.fit_transform does require a shape (n_values, n_features) but, I do not know how to handle this.

Upvotes: 0

Views: 1024

Answers (2)

panktijk
panktijk

Reputation: 1614

If you wish to use pd.get_dummies there is an option to iteratively include your encodings for every batch.

For your first batch:

ohe = pd.get_dummies(data_rest, columns=['label_col'])

For every subsequent batch:

for b in batches:
    batch_ohe = pd.get_dummies(b, columns=['label_col'])
    ohe = pd.concat([ohe, batch_ohe], axis=0)

ohe = ohe.fillna(0)

Upvotes: 0

Burak yazicı
Burak yazicı

Reputation: 36

HashingVectorizer maybe a good solution for your case.It is independent from number of input features , just set initial size big enough.

Upvotes: 0

Related Questions