Reputation: 191
I have a dataset like
Project | Area | Feature 1 | Feature 2 |
---------+------------+-----------+-----------+...
A | Production | X | X |
A | Testing | Y | Y |
B | Testing | Z | Z |
C | QA | W | W |
Here "Area" is dependent on project (i.e. Combination of Area and Project makes the identity of Area) and they have many to many relationship. I'm predicting Area using deep neural network using Keras. How i should preprocess this data?
Project is a very important feature.
Also is there any formula for approximating number of training data required for X number of features?
Upvotes: 0
Views: 2355
Reputation: 11968
Having related features is not in itself a problem. The problems usually show up as when you don't have the same input features when you are training and when you are doing predictions.
Also make sure that the relation makes sense. In some cases it can lead to more accurate results that you might interpret the wrong way, or the model memorizing results. It's really hard to give decent advice here without knowing more about the problem.
As for number of examples it really depends on the complexity of the problem. Even for a single input, if what you are trying to learn is a constant function, one example is enough, but if you are trying to learn a hash function you are going to need a lot more and even then it might not work or make mistakes. My suggestion is to train it with what you have, check how the loss progresses and extrapolate from there.
Upvotes: 1