shasan
shasan

Reputation: 188

Naïve Bayes Classifier -- is normalization necessary?

We recently studied the Naïve Bayesian Classifier in our Machine Learning class and now I'm trying to implement it on the Fisher Iris dataset as a self-exercise. The concept is easy and straightforward, with some trickiness involved for continuous attributes. I read up several literature resources which recommended using a Gaussian approximation to compute probability of test data values, so I'm going with it in my code.

Now I'm trying to run it initially for 50% training and 50% test data samples, but something is missing. The current code is always predicting class 1 (I used integers to represent the classes) for all test samples, which is obviously wrong.

My guess is that the problem may be due to normalization being omitted by the code? Though I think adding normalization would still yield proportionate results, and so far my attempts to normalize have produced the same classification results.

Can someone please suggest if there is anything obvious missing here? Or if I'm not approaching this right? Since most of the code is 'mechanics', I have made prominent (****************) the 2 lines that are responsible for the calculations. Any help is appreciated, thanks!

nsamples=75;                                      % 50% samples
% acquire training set and test set
[trainingSample,idx] = datasample(data,nsamples,'Replace',false);
testData = data(setdiff(1:150,idx),:);

% define Gaussian function
%***********************************************************%
Phi=@(mu,sig2,x) (1/sqrt(2*pi*sig2))*exp(-((x-mu)^2)/2*sig2);
%***********************************************************%
   
for c=1:3                                         % for 3 classes in training set
    clear y x mu sig2;
    index=1;
    for i=1 : length(trainingSample)
        if trainingSample(i,5)==c
            y(index,:)=trainingSample(i,:);       % filter current class samples
            index=index+1;                        % for conditional probabilities
        end
    end
    
    for j=1:size(testData,1)                      % iterate over test samples
        clear pf p;
        for i=1:4                                 % iterate over columns
            x=testData(j,i);                      % representing attributes
            mu=mean(y(:,i));
            sig2=var(y(:,i));
            pf(i) = Phi(mu,sig2,x);               % calc conditional probability
        end
        
        % calc class likelihood; prior * posterior
        %*****************************************************%
        pc(j,c) = size(y,1)/nsamples * pf(1)*pf(2)*pf(3)*pf(4);
        %*****************************************************%
    end
end

% find the predicted class for each test sample
% by taking the max probability calculated    
for i=1:size(pc,1)
    [~,q]=max(pc(i,:));
    predicted(i)=q;
    actual(i)=testData(i,5);
end

Upvotes: 2

Views: 11157

Answers (1)

SlimJim
SlimJim

Reputation: 2272

Normalization shouldn't be necessary since the features are only compared to each other.

p(class|thing) = p(class)p(thing|class) =
= p(class)p(feature_1|class)p(feature_2|class)...p(feature_N|class)

So when fitting the parameters for the distribution feature_i|class it will just rescale the parameters (for the new "scale") in this case (mu, sigma2), but the probabilities will remain the same.

It's hard to read the matlab code due to alot of indexing and splitting of training/testing etc. Which is a possible problem source. You should try something with a lot less non-necessary stuff around it (I would recommend python with scikit-learn for example, alot of helpers for splitting data and such http://scikit-learn.org/).

It's really important that you separate the training and test data, and only train the model with training data and test the trained model with the test data. (Is this done?)

Next step is to check the parameters which is easiest done with either printing them out (sanity check) or..

for each feature render the gaussian bells fitted next to a histogram of the data to see that they match (remember that each histogram bar must be of height number_of_samples_within_range/total_number_of_samples.

Visualising the data and the model is really important to know what is happening.

Upvotes: 3

Related Questions