Classify stream of data using hidden markov models

Question

Problem

In an on-line process consisting of different steps I have data of people that complete the process and the people that drop out. The each user, the data consists of a sequence of process steps per time interval, let's say a second.

An example of such a sequence of a completed user would be [1,1,1,1,2,2,2,3,3,3,3....-1] where the user is in step 1 for four seconds, followed by step 2 for three seconds and step 3 for four seconds etc before reaching the end of the process (denoted by -1). An example of a drop out would be [1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2] where the user would be spending an excessive timespan in step 1, then 5 seconds in step 2 and then closing the webpage (so not reaching the end (-1))

Based on a model I would like to be able to predict/classify online (as in in 'real-time') the probability of the user completing the process or dropping out.

Approach

I have read about HMMs and I would to apply the following principle:

train one model using the sequences of people of that completed the process
train another model using the sequences of people that did not complete the process
collect the stream of incoming data of an unseen user and at each timestep use the forward algorithm on each of the models to see which of the two models is most likely to output this stream. The corresponding model represents then the label associated to this stream.

What is your opinion? Is this doable? I have been looking at the Python libraries hmmlearn and pomegranate, but I can not seem to create a small working example to test with. Some test code of mine can be found below with some artificial data:

from pomegranate import *
import numpy as np

# generate data of some sample sequences of length 4
# mean and std of each step in sequence
means = [1,2,3,4] 
stds = [0.1, 0.1, 0.1, 0.1]
num_data = 100

data = []

for mean, std in zip(means, stds):
    d = np.random.normal(mean, std, num_data)
    data.append(d)

data = np.array(data).T
# create model (based on sample code of pomegranate https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_3_Hidden_Markov_Models.ipynb)
s1 = State( NormalDistribution( 1, 1 ), name="s1" )
s2 = State( NormalDistribution( 2, 1 ), name="s2" )

model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5, pseudocount=4.2 )
model.add_transition( model.start, s2, 0.5, pseudocount=1.3 )

model.add_transition( s1, s2, 0.5, pseudocount=5.2 )
model.add_transition( s2, s1, 0.5, pseudocount=0.9 )
model.bake()
#model.plot()
# fit model
model.fit( data, use_pseudocount=False, algorithm = 'baum-welch', verbose=False )
# get probability of very clean sequence (mean of each step)
p = model.probability([1,2,3,4])
print p # 3.51e^-112

I would expect here that the probability of this very clean sequence would be close to 1, since the values are the means of each of the distributions of the steps. How can I make this example better and eventually apply it for my application?

Concerns

I am not sure what states and transitions my model should comprise of. What is a 'good' model? How can you know that you need to add more states to the model to add more expressive data given the data. The tutorials of pomegranate are nice but insufficient for me to apply HMM's in this context.

Classify stream of data using hidden markov models

Problem

Approach

Concerns

Answers (1)

Related Questions