Reputation: 673
Usually, when using Keras, the datasets used to train the neural network are labeled.
For example, if I have a 100,000 rows of patients with 12 field per each row, then the last field will indicate if this patient is diabetic or no (0 or 1). And then after training is finished I can insert a new record and predict if this person is diabetic or no.
But in the case of unlabeled datasets, where I can not label the data due to some reasons, how can I train the neural network to let him know that those are the normal records and any new record that does not match this network will be malicious or not accepted ?
Upvotes: 1
Views: 2187
Reputation: 7148
This is called one-class learning and is usually done by using autoencoders. You train an autoencoder on the training data to reconstruct the data itself. The labels in this case is the input itself. This will give you a reconstruction error. https://en.wikipedia.org/wiki/Autoencoder
Now you can define a threshold where the data is benign or not, depending on the reconstruction error. The hope is that the reconstruction of the good data is better than the reconstruction of the bad data.
Edit to answer the question about the difference in performance between supervised and unsupervised learning.
This cannot be said with any certainty, because I have not tried it and I do not know what the final accuracy is going to be. But for a rough estimate supervised learning will perform better on the trained data, because more information is supplied to the algorithm. However if the actual data is quite different to the training data the network will underperform in practice, while the autoencoder tends to deal better with different data. Additionally, per rule of thumb you should have 5000 examples per class to train a neural network reliably, so labeling could take some time. But you will need some data to test anyways.
Upvotes: 1
Reputation: 2135
It sounds like you need fit two different models:
For both of these models, you will need to have labels. For the first model your labels would indicate whether the record is good or bad (malicious) and the second would be whether the patient is diabetic or not.
In order to detect bad records, you may find that simple logistic regression or SVM performs adequately.
Upvotes: 0