st19297
st19297

Reputation: 629

Generating Markov transition matrix in Python

Imagine I have a series of 4 possible Markovian states (A, B, C, D):

X = [A, B, B, C, B, A, D, D, A, B, A, D, ....]

How can I generate a Markov transformation matrix using Python? The matrix must be 4 by 4, showing the probability of moving from each state to the other 3 states. I've been looking at many examples online but in all of them, the matrix is given, not calculated based on data. I also looked into hmmlearn but nowhere I read on how to have it spit out the transition matrix. Is there a library that I can use for this purpose?

Here is an R code for the exact thing I am trying to do in Python: https://stats.stackexchange.com/questions/26722/calculate-transition-matrix-markov-in-r

Upvotes: 24

Views: 54065

Answers (5)

sander
sander

Reputation: 1440

In Pandas there is a much easier solution: pd.crosstab. Given your sequence:

X = ["A", "B", "B", "C", "B", "A", "D", "D", "A", "B", "A", "D"]

matrix = pd.crosstab(
    pd.Series(X[:-1], name='from'),
    pd.Series(X[1:], name='to'),
    normalize=0
)

Resulting in the following pd.DataFrame:

    to  A   B    C    D
from                
A       0.0 0.50 0.00 0.5
B       0.5 0.25 0.25 0.0
C       0.0 1.00 0.00 0.0
D       0.5 0.00 0.00 0.5

If you want a np.array instead, use matrix.to_numpy() which results in:

[[0.   0.5  0.   0.5 ]
 [0.5  0.25 0.25 0.  ]
 [0.   1.   0.   0.  ]
 [0.5  0.   0.   0.5 ]]

Upvotes: 5

Andrea Dalseno
Andrea Dalseno

Reputation: 190

Thank you @john-coleman , I have updated your code using numpy:

import numpy as np

def transition_matrix(transitions):
    n = 1+ max(transitions) #number of states

    M = np.zeros((n,n))

    for (i,j) in zip(transitions,transitions[1:]):
        M[i][j] += 1

    #now convert to probabilities:
    M = M/M.sum(axis=1, keepdims=True)
    return M

t = [1,1,2,6,8,5,5,7,8,8,1,1,4,5,5,0,0,0,1,1,4,4,5,1,3,3,4,5,4,1,1]
m = transition_matrix(t)
for row in m: print(' '.join(f'{x:.2f}' for x in row))

The output is the same:

0.67 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.50 0.12 0.12 0.25 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
0.00 0.00 0.00 0.50 0.50 0.00 0.00 0.00 0.00
0.00 0.20 0.00 0.00 0.20 0.60 0.00 0.00 0.00
0.17 0.17 0.00 0.00 0.17 0.33 0.00 0.17 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
0.00 0.33 0.00 0.00 0.00 0.33 0.00 0.00 0.33

Upvotes: 0

The following code provides another solution about Markov transition matrix order 1. Your data can be list of integers, list of strings, or a string. The negative think is that this solution -most likely- requires time and memory.

  1. creates a Markov transition matrix order 1 (bigrams)
  2. generates 1000 integers in order to train the Markov transition matrix to a dataset.
  3. train the Markov transition matrix

Until here we have the solution of the question. The following code try to solve an additional problem. Specifically, the generating data according to the trained Markov task.

  1. transform probabilities of markov transition matrix to cumulative (arithmetic coding)
  2. generating 30 data
import pandas as pd

def transition_matrix_order1(data):
    alphabet = []
    for element in data:
        if element not in alphabet:
            alphabet.append(element)
    alphabet.sort()
    
    previous = data[0]
    matrix = pd.DataFrame(0.0, index=alphabet, columns=alphabet)
    
    for i in data[1:]:
        matrix[i][previous]    += 1.0
        previous = i
    
    total = matrix.sum()
    for element in alphabet:
        matrix[element] = matrix.div(total[element])[element]
    
    return matrix, alphabet



#create data using random integers========
import random
data = [random.randint(1,5) for i in range(1000)] #You can also put list of strings or a string as input data



#create markov transition matrix order 1 (bigram)
markov_matrix, alphabet = transition_matrix_order1(data)



#=the following code uses the probabilities in order to create new data.=



#transform probabilities of markov transition matrix to cumulative
for column in alphabet:
    for pos, index in enumerate(alphabet[1:]):
        markov_matrix[column][index] += markov_matrix[column][alphabet[pos]]




#generating 30 data
generated_data = []
feed = random.choice(alphabet)
generated_data.append(feed)
for i in range(30):
    random_value = random.uniform(0, 1)
    for i in alphabet:
        if markov_matrix[feed][i] >= random_value:
            generated_data.append(i)
            feed = i
            break



print(generated_data)

Upvotes: 2

Iain D
Iain D

Reputation: 507

If you want to do it all in pandas, here is an approach that works for non numeric data:

import pandas as pd
transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D']

df = pd.DataFrame(transitions)

# create a new column with data shifted one space
df['shift'] = df[0].shift(-1)

# add a count column (for group by function)
df['count'] = 1

# groupby and then unstack, fill the zeros
trans_mat = df.groupby([0, 'shift']).count().unstack().fillna(0)

# normalise by occurences and save values to get transition matrix
trans_mat = trans_mat.div(trans_mat.sum(axis=1), axis=0).values

It's slower than the pure python approach but maybe worth it for flexibility and to avoid creating your own function.

Upvotes: 16

John Coleman
John Coleman

Reputation: 51988

This might give you some ideas:

transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D']

def rank(c):
    return ord(c) - ord('A')

T = [rank(c) for c in transitions]

#create matrix of zeros

M = [[0]*4 for _ in range(4)]

for (i,j) in zip(T,T[1:]):
    M[i][j] += 1

#now convert to probabilities:
for row in M:
    n = sum(row)
    if n > 0:
        row[:] = [f/sum(row) for f in row]

#print M:

for row in M:
    print(row)

output:

[0.0, 0.5, 0.0, 0.5]
[0.5, 0.25, 0.25, 0.0]
[0.0, 1.0, 0.0, 0.0]
[0.5, 0.0, 0.0, 0.5]

On Edit Here is a function which implements the above ideas:

#the following code takes a list such as
#[1,1,2,6,8,5,5,7,8,8,1,1,4,5,5,0,0,0,1,1,4,4,5,1,3,3,4,5,4,1,1]
#with states labeled as successive integers starting with 0
#and returns a transition matrix, M,
#where M[i][j] is the probability of transitioning from i to j

def transition_matrix(transitions):
    n = 1+ max(transitions) #number of states

    M = [[0]*n for _ in range(n)]

    for (i,j) in zip(transitions,transitions[1:]):
        M[i][j] += 1

    #now convert to probabilities:
    for row in M:
        s = sum(row)
        if s > 0:
            row[:] = [f/s for f in row]
    return M

#test:

t = [1,1,2,6,8,5,5,7,8,8,1,1,4,5,5,0,0,0,1,1,4,4,5,1,3,3,4,5,4,1,1]
m = transition_matrix(t)
for row in m: print(' '.join('{0:.2f}'.format(x) for x in row))

Output:

0.67 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.50 0.12 0.12 0.25 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
0.00 0.00 0.00 0.50 0.50 0.00 0.00 0.00 0.00
0.00 0.20 0.00 0.00 0.20 0.60 0.00 0.00 0.00
0.17 0.17 0.00 0.00 0.17 0.33 0.00 0.17 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
0.00 0.33 0.00 0.00 0.00 0.33 0.00 0.00 0.33

Upvotes: 34

Related Questions