Reputation: 188
We recently studied the Naïve Bayesian Classifier in our Machine Learning class and now I'm trying to implement it on the Fisher Iris dataset as a self-exercise. The concept is easy and straightforward, with some trickiness involved for continuous attributes. I read up several literature resources which recommended using a Gaussian approximation to compute probability of test data values, so I'm going with it in my code.
Now I'm trying to run it initially for 50% training and 50% test data samples, but something is missing. The current code is always predicting class 1 (I used integers to represent the classes) for all test samples, which is obviously wrong.
My guess is that the problem may be due to normalization being omitted by the code? Though I think adding normalization would still yield proportionate results, and so far my attempts to normalize have produced the same classification results.
Can someone please suggest if there is anything obvious missing here? Or if I'm not approaching this right? Since most of the code is 'mechanics', I have made prominent (****************) the 2 lines that are responsible for the calculations. Any help is appreciated, thanks!
nsamples=75; % 50% samples
% acquire training set and test set
[trainingSample,idx] = datasample(data,nsamples,'Replace',false);
testData = data(setdiff(1:150,idx),:);
% define Gaussian function
%***********************************************************%
Phi=@(mu,sig2,x) (1/sqrt(2*pi*sig2))*exp(-((x-mu)^2)/2*sig2);
%***********************************************************%
for c=1:3 % for 3 classes in training set
clear y x mu sig2;
index=1;
for i=1 : length(trainingSample)
if trainingSample(i,5)==c
y(index,:)=trainingSample(i,:); % filter current class samples
index=index+1; % for conditional probabilities
end
end
for j=1:size(testData,1) % iterate over test samples
clear pf p;
for i=1:4 % iterate over columns
x=testData(j,i); % representing attributes
mu=mean(y(:,i));
sig2=var(y(:,i));
pf(i) = Phi(mu,sig2,x); % calc conditional probability
end
% calc class likelihood; prior * posterior
%*****************************************************%
pc(j,c) = size(y,1)/nsamples * pf(1)*pf(2)*pf(3)*pf(4);
%*****************************************************%
end
end
% find the predicted class for each test sample
% by taking the max probability calculated
for i=1:size(pc,1)
[~,q]=max(pc(i,:));
predicted(i)=q;
actual(i)=testData(i,5);
end
Upvotes: 2
Views: 11157
Reputation: 2272
Normalization shouldn't be necessary since the features are only compared to each other.
p(class|thing) = p(class)p(thing|class) =
= p(class)p(feature_1|class)p(feature_2|class)...p(feature_N|class)
So when fitting the parameters for the distribution feature_i|class
it will just rescale the parameters (for the new "scale") in this case (mu, sigma2), but the probabilities will remain the same.
It's hard to read the matlab code due to alot of indexing and splitting of training/testing etc. Which is a possible problem source. You should try something with a lot less non-necessary stuff around it (I would recommend python with scikit-learn for example, alot of helpers for splitting data and such http://scikit-learn.org/).
It's really important that you separate the training and test data, and only train the model with training data and test the trained model with the test data. (Is this done?)
Next step is to check the parameters which is easiest done with either printing them out (sanity check) or..
for each feature render the gaussian bells fitted next to a histogram of the data to see that they match (remember that each histogram bar must be of height number_of_samples_within_range/total_number_of_samples
.
Visualising the data and the model is really important to know what is happening.
Upvotes: 3