abautista
abautista

Reputation: 2790

How to process multiple categorical columns in an artificial neural network?

Currently, I am working in the following dataset where I have multiple columns with different values and I want to classify each row into the correct category - in this case, correct engineer.

Goal: Based on the category, problem category, affected devices, reason for creating, issue status and priority, determine to which engineer the tickets belongs to. This is a classification problem and I am using an Artificial Neural Network to solve this problem.

Structure of the dataset

Category       |  Problem Category   | Affected devices      | Reason for creating    | Issue status | Priority | Security Engineer 

Cybersecurity     Penetration breach   Personal user devices   Hourly analysis          Transferred          3         K. Schulz
Cybersecurity     Lack of Cert         Company main devices    Hourly analysis          Closed               2         U. Frank
IoT               Malware installed    Personal user devices   Hourly analysis          Transferred          2         L. Tolso
....
....


# Matrix of features
X = dataset.iloc[:,:-1].values

# dependent variable: engineers
y = dataset.iloc[:,-1:].values

# Encode the categorical data to numerical data
# The priority column will not be encoded because it is already in numerical data, i.e., 0,1,2,3.

labelEncoder_X_category            = LabelEncoder()
labelEncoder_X_problem_category    = LabelEncoder()
labelEncoder_X_affected_devices    = LabelEncoder()
labelEncoder_X_reason_for_creating = LabelEncoder()
labelEncoder_X_issue_status        = LabelEncoder()

X[:, 0] = labelEncoder_X_category.fit_transform(X[:, 0])
X[:, 1] = labelEncoder_X_problem_category.fit_transform(X[:, 1])
X[:, 2] = labelEncoder_X_affected_devices.fit_transform(X[:, 2])
X[:, 3] = labelEncoder_X_reason_for_creating.fit_transform(X[:, 3])
X[:, 4] = labelEncoder_X_issue_status.fit_transform(X[:, 4])

# Create dummy variable
# Column zero (Category) will be used to split the encoded data of this column into multiple columns with many 0s and 1s
oneHotEncoder_category         = OneHotEncoder(categorical_features = [0])

# Once the column zero has been separated into columns of 0s and 1s, attach it to the current matrix of features
X  = oneHotEncoder_category.fit_transform(X).toarray()

print(X)

# Split the data into training and test set
# Not yet implemented because I want to solve my questions

# Feature scaling       
# Not yet implemented because I want to solve my questions

Questions

  1. All the columns have been encoded into numerical values except for the Priority column and only column zero (Category) was split into several columns of 0s and 1s but do I also need to split the other columns into 0s and 1s or is it just enough with only 1 column?

  2. I am concerned that I need to avoid the Lack of Multicollinearity problem, that is, I cannot include all the dummy variables in my model but how can I apply this same principle if I encode the rest of columns into 0s and 1s?

I tried to elaborate as much as possible my situation of this problem and I hope I haven't confused anyone but if I did, feel free to correct me or ask me more questions, I would be more than happy to assist you.

Upvotes: 0

Views: 1240

Answers (1)

nuric
nuric

Reputation: 11225

There are different ways for encoding categorical data but the most common one is one-hot encoding. In your case:

  1. Yes, you will have to one-hot encode all categorical columns such that every column becomes a vector [0,0,...,1,0,0,...]. Now you can concatenate all the column vectors into large single one as your input to the network. The output will be a classification of the engineer. You might also want to one-hot encode priority as there are probably finite, discrete values which can be considered as categories.

  2. I'm not sure why you are concerned about multicollinearity. Often that is concern if you are doing regression, in your case for classification the neural network will basically select the combinations of states (because you one-hot every column) and learn to ignore others. This is true for any learning algorithm, if engineer A always to responds to a certain category then that will be enough to classify the data.

Looking at your data, I would consider using a decision tree. Leaf nodes would be engineers and you branch from most the distinguishing feature. The advantage is you can know exactly what it has learnt and visualise it. Even better, if there is a slight change (new category, new engineer etc.) you can modify the tree manually until you have new training data.

Upvotes: 1

Related Questions