Reputation: 85
I have the following problem: I have a dataset (arff), there are stored: character, key holdtime, user. So with this information, I have to calculate the probability for one person who is typing on keyboard.
If a person is typing on keyboard, same information as above will be extracted (user, key holdtime, user) and will be "compared" with the arff file. The result should be as follow: I have a dataset for user "John" in the arff-file. After that, one user types his username "John" and writes a text. The result should be the probability that the user "Johns" typing is equivalent with the dataset of "John" stored in the arff. To 90% it is the right person, it is to 90% John.
I hope, I could explain my problem. My question is, which classifier should I take in this case? I did it with IBK, but if I have 15 persons, probability will be divided through 15 and I get small probabilities. Probability depends on the number of stored persons in arff. Or should I multiply the result with the number of persons to get the real probability?
Upvotes: 1
Views: 174
Reputation: 146
Note: the sum of all the probabilities of a distribution has to be 1.
It is somehow true that you get "small probabilities" when you have more classes, but it's NOT because it is divided by the number of classes, so you won't find the probability you want multiplying the result with the number classes: it is not a probability anymore (it could easily become >1).
The probability distribution that you obtained using IBk is different from what you wanted: it tells you which one, between stored users, is more similar to the current user (probability of being John vs being Paul vs being Sarah etc.), indipendently from the name he said.
The output you want is the result of a binary classifier, but you'll need to train a classifier for every user you stored.
The training set of each classifier will be similar to the dataset you already have, but (in the case of John) there will be isJohn
instead of user
, and this new column will contanin true
if user
was John and false
otherwise.
EDIT
character, key holdtime, user
90, 150ms, John
70, 120ms, Sarah
100, 110ms, Paul
will become
character, key holdtime, isJohn
90, 150ms, true
70, 120ms, false
100, 110ms, false
The output distribution of this classifier is is John
vs is not John
.
To have the exact output you want, you must train a classifier for each stored user and call the right one depending on the name the current user said.
About which classifier to use, I think there is not a way to know which is the best for your case. I usually try some classifier and choose the best one
Upvotes: 1