Reputation: 31
In an on-line process consisting of different steps I have data of people that complete the process and the people that drop out. The each user, the data consists of a sequence of process steps per time interval, let's say a second.
An example of such a sequence of a completed user would be [1,1,1,1,2,2,2,3,3,3,3....-1]
where the user is in step 1 for four seconds,
followed by step 2 for three seconds and step 3 for four seconds etc before reaching the end of the process (denoted by -1).
An example of a drop out would be [1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2]
where the user would be spending an excessive timespan in step 1, then 5 seconds in step 2 and then closing the webpage (so not reaching the end (-1))
Based on a model I would like to be able to predict/classify online (as in in 'real-time') the probability of the user completing the process or dropping out.
I have read about HMMs and I would to apply the following principle:
train one model using the sequences of people of that completed the process
train another model using the sequences of people that did not complete the process
collect the stream of incoming data of an unseen user and at each timestep use the forward algorithm on each of the models to see which of the two models is most likely to output this stream. The corresponding model represents then the label associated to this stream.
What is your opinion? Is this doable? I have been looking at the Python libraries hmmlearn and pomegranate, but I can not seem to create a small working example to test with. Some test code of mine can be found below with some artificial data:
from pomegranate import *
import numpy as np
# generate data of some sample sequences of length 4
# mean and std of each step in sequence
means = [1,2,3,4]
stds = [0.1, 0.1, 0.1, 0.1]
num_data = 100
data = []
for mean, std in zip(means, stds):
d = np.random.normal(mean, std, num_data)
data.append(d)
data = np.array(data).T
# create model (based on sample code of pomegranate https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_3_Hidden_Markov_Models.ipynb)
s1 = State( NormalDistribution( 1, 1 ), name="s1" )
s2 = State( NormalDistribution( 2, 1 ), name="s2" )
model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5, pseudocount=4.2 )
model.add_transition( model.start, s2, 0.5, pseudocount=1.3 )
model.add_transition( s1, s2, 0.5, pseudocount=5.2 )
model.add_transition( s2, s1, 0.5, pseudocount=0.9 )
model.bake()
#model.plot()
# fit model
model.fit( data, use_pseudocount=False, algorithm = 'baum-welch', verbose=False )
# get probability of very clean sequence (mean of each step)
p = model.probability([1,2,3,4])
print p # 3.51e^-112
I would expect here that the probability of this very clean sequence would be close to 1, since the values are the means of each of the distributions of the steps. How can I make this example better and eventually apply it for my application?
I am not sure what states and transitions my model should comprise of. What is a 'good' model? How can you know that you need to add more states to the model to add more expressive data given the data. The tutorials of pomegranate are nice but insufficient for me to apply HMM's in this context.
Upvotes: 3
Views: 3028
Reputation: 77827
Yes, the HMM is a viable way to do this, although it's a bit of overkill, since the FSM is a simple linear chain. The "model" can also be built from mean and variation for each string length, and you can simply compare the distance of the partial string to each set of parameters, rechecking at each desired time point.
The states are simple enough:
1 ==> 2 ==> 3 ==> ... ==> done
Each state has a loop back to itself; this is the most frequent choice. There is also a transition to "failed" from any state.
Thus, the Markov Matrices will be sparse, something like
1 2 3 4 done failed
0.8 0.1 0 0 0 0.1
0 0.8 0.1 0 0 0.1
0 0 0.8 0.1 0 0.1
0 0 0 0.8 0.1 0.1
0 0 0 0 1.0 0
0 0 0 0 0 1.0
Upvotes: 0