Reputation: 142
I m using this dataset of crop agriculture. In order to use it for creating a neural network, I preprocessed the data using MinMaxScalar, this would scale the data between 0 and 1. But my dataset also consist of categorical columns, because of which I got an error during preprocessing. So I tried encoding the categorical columns using OneHotEncoder and LabelEncoder but I don't understand what to do with it then.
My aim is to predict "Crop_Damage".
How do I proceed ?
Link to the dataset - https://www.kaggle.com/aniketng21600/crop-damage-information-in-india
Upvotes: 0
Views: 756
Reputation: 8112
here is a one hot encoder. df is the data frame you are working with, column is the name of the column you want to encode. prefix is a string that gets appended to the column names created by pandas dummies. What happens is the new dummy columns are created and appended to the data frame as new columns. The original column is then deleted. There is an excellent series of videos on encoding data frames and other topics on Youtube here.
def onehot_encode(df, column, prefix):
df = df.copy()
dummies = pd.get_dummies(df[column], prefix=prefix)
df = pd.concat([df, dummies], axis=1)
df = df.drop(column, axis=1)
return df
Upvotes: 0
Reputation: 883
You have several options.
You may use one hot encoding and pass your categorical variable to network as one-hot network.
You may get inspiration from NLP and their processing. One hot vectors are sparse and may be really huge(depends on unique values of your categorical variable). Please look at techniques Word2vec(cat2vec) or GloVe. Both of them aims to create from categorical element, nonsparse numeric vector(meaningful).
Beside of these two keras offer another way how to obtain this numeric vector. Its called embeded layer. For example, lets consider that you have variable Crop damage with these values:
First you assign unique integer for every unique value of your categorical variable.
Than you pass translated categorical values(unique integers) to emebeded layer. Embeded layer takes at input sequence of unique integers and produce sequence of dense vectors. Values of these vectors are firstly random, but during training are optimized like regular weights of neural network. So we can say that during the training neural network build vector representation of categories according to loss function.
For me is embeded layer the easiest way to obtain good enough vector representation of categorical variables. But you can try first with one hot if accuracy satisfy you.
Upvotes: 2