user846400
user846400

Reputation: 1101

Online learning with Naive Bayes Classifier

I am trying to predict the inter-arrival time of the incoming network packets. I measure the inter-arrival times of network packets and represent this data in the form of binary features: xi= 0,1,1,1,0,... where xi=0 if the inter-arrival time is less than a break-even-time and 1 otherwise. The data has to be mapped into two possible classes C={0,1}, where C=0 represents a short inter-arrival time and 1 represents a long inter-arrival time. Since I want to implement the classifier in an online feature, where as soon as I observe a vector of features xi=0,1,1,0..., I calculate the MAP class. Since I don't have a prior estimation of the conditional and prior probabilities, I initialize them as follows:

p(x=0|c=0)=p(x=1|c=0)=p(x=0|c=1)=p(x=1|c=1)=0.5
p(c=0)=p(c=1)=0.5

For each feature vector (x1=m1,x2=m2,...,xn=mn), when I output a class C, I update the conditional and prior probabilities as follows:

p(xi=mi|y=c)=a+(1-a)*p(p(xi=mi|c)
p(y=c)=b+(1-b)*p(y=c)

The problem is that, I am always getting a biased prediction. Since the number of long inter-arrival times are comparatively less than the short, the posterior of short always remains higher than the long. Is there any way to improve this? or am I doing something wrong? Any help will be appreciated.

Upvotes: 0

Views: 4196

Answers (1)

etov
etov

Reputation: 3032

Since you have a long time series, the best path would probably be to take into account more than a single previous value. the standard way of doing this would be to use a time-window, i.e. split the long vector Xi to overlapping pieces of a constant length, with the last value treated as the class, and use them as the train set. This could be also done on streaming data in an online manner, by incrementally updating the NB model with new data as it arrives.

Note that Using this method, other regression algorithms might end up being a better choice than NB.

Weka (version 3.7.3 and up) has a very nice dedicated tool supporting time-series analysis. alternatively, MOA is also based on Weka, and supports modeling of streaming data.

EDIT: it might also be a good idea to move from binary features to the real values (maybe normalized), and apply the threshold post-classification. This might give more information to the regression model (NB or other), allowing better accuracy.

Upvotes: 1

Related Questions