user2578525
user2578525

Reputation: 191

Handling Dependent features in machine learning

I have a dataset like

 Project | Area       | Feature 1 | Feature 2 |
---------+------------+-----------+-----------+...
 A       | Production |     X     |     X     |
 A       | Testing    |     Y     |     Y     |
 B       | Testing    |     Z     |     Z     |
 C       | QA         |     W     |     W     |

Here "Area" is dependent on project (i.e. Combination of Area and Project makes the identity of Area) and they have many to many relationship. I'm predicting Area using deep neural network using Keras. How i should preprocess this data?

Project is a very important feature.

Also is there any formula for approximating number of training data required for X number of features?

Upvotes: 0

Views: 2355

Answers (1)

Sorin
Sorin

Reputation: 11968

Having related features is not in itself a problem. The problems usually show up as when you don't have the same input features when you are training and when you are doing predictions.

Also make sure that the relation makes sense. In some cases it can lead to more accurate results that you might interpret the wrong way, or the model memorizing results. It's really hard to give decent advice here without knowing more about the problem.

As for number of examples it really depends on the complexity of the problem. Even for a single input, if what you are trying to learn is a constant function, one example is enough, but if you are trying to learn a hash function you are going to need a lot more and even then it might not work or make mistakes. My suggestion is to train it with what you have, check how the loss progresses and extrapolate from there.

Upvotes: 1

Related Questions