Estimating transition probabilities (pandas)

Question

I have data on 3 event types and I want to estimate transition probabilities Pij(1). These indicate the probability that some event i is followed by event j, given that event i happened (so I need conditional probabilities). I also want to know Pij(2) and Pij(3), which is the conditional probability that the second (third) event after event i is event j.

Have a look at some mock-up data:

import pandas as pd
import numpy as np
np.random.seed(5)
strings=list('ABC')
events=[strings[i] for i in np.random.randint(0,3,20)]
groups=[1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2]
index=pd.date_range('2/2/2012',periods=20,freq='T')
dfm=pd.DataFrame(data={'event':events,'group':groups},index=index)
dfm.head()

                  event group
2012-02-02 00:00:00 C   1
2012-02-02 00:01:00 B   1
2012-02-02 00:02:00 C   1
2012-02-02 00:03:00 C   1
2012-02-02 00:04:00 A   1

So far I have followed a very inelegant and naïve strategy and used shift to see which events occurred in the next periods:

#Create new columns containing the shifted values
for i in range(1,4):
    dfm['event_t%i'%i]=dfm.event.groupby(dfm.group).shift(-i)
#Combine the columns with current and shifted values into one 
for i in range(1,4):
    dfm['NEWevent_t%i'%i]=dfm['event']+' '+dfm['event_t%i'%i]
    dfm = dfm.drop('event_t%i'%i, 1)

#Count the number of times each combination occurs
A=dfm['NEWevent_t1'].groupby(dfm.group).value_counts()
B=dfm['NEWevent_t2'].groupby(dfm.group).value_counts()
C=dfm['NEWevent_t3'].groupby(dfm.group).value_counts()

merged=pd.concat([A, B, C], axis=1)

This indeed gives the number of times a specific event combination (e.g. AA, AB,..) occurs for each group. Proceeding with this, I can do a groupby using both the group variable and the first letter in the two-letter pair as grouping variables. This brute force solution could look like:

merged=merged.reset_index()
merged['first']=merged['level_1'].apply(lambda x: x[0])
merged.columns=['group','i j','t1','t2','t3','first']
merged.groupby(['group','first'])['t1','t2','t3'].sum()
sums=merged.groupby(['group','first'])['t1','t2','t3'].sum()
merged=pd.merge(merged,sums,left_on=['group','first'],right_index=True)
merged['Pij(1)']=merged.t1_x/merged.t1_y
merged['Pij(2)']=merged.t2_x/merged.t2_y
merged['Pij(3)']=merged.t3_x/merged.t3_y
merged[['group','i j','Pij(1)','Pij(2)','Pij(3)']]
merged.head()

  group i j Pij(1)  Pij(2)      Pij(3)
0   1   A A 0.25    0.666667    0.666667
1   1   A B 0.25    NaN         NaN
2   1   A C 0.50    0.333333    0.333333
3   1   B A 0.50    0.500000    0.500000
4   1   B C 0.50    0.500000    0.500000

I believe that there must be a much easier way to achieve this? Any suggestions on how to make this more efficient?

Note: my actual dataset contains 5 million rows, 10 event types and 100 groups.

thefourtheye · Accepted Answer

The best way to present transition probabilities is in a transition matrix where T(i,j) is the probability of Ti going to Tj. Let's start with your data:

import pandas as pd
import numpy as np

np.random.seed(5)
strings=list('ABC')
events=[strings[i] for i in np.random.randint(0,3,20)]
groups=[1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2]
index=pd.date_range('2/2/2012',periods=20,freq='T')
dfm=pd.DataFrame(data={'event':events,'group':groups},index=index)
for i in range(1,4):
    dfm['event_t%i'%i]=dfm.event.groupby(dfm.group).shift(-i)

I think your shift command thing is okay, but that's just me. Anyway, from here you limit to 'group' == 1 and populate the transition matrix. In the end, you divide by the columns to get the transition probabilities.

trans = pd.DataFrame(columns=strings, index=strings)
g_dfm = dfm[dfm['group']==1]

for s1 in strings:
    for s2 in strings:
        events = g_dfm[(g_dfm['event']==s1) & (g_dfm['event_t1']==s2)]
        trans.ix[s1, s2] = len(events)

trans = trans.astype(float).div(trans.sum(axis=1), axis=0)
trans = trans.fillna(0)

From there, you can make a heatmap:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(3,3))
ax.pcolormesh(trans.values, cmap=plt.get_cmap('Blues'), vmin=0, vmax=1)
ax.invert_yaxis()
ax.set_yticks(np.arange(0, len(trans.index))+0.5)
ax.set_xticks(np.arange(0, len(trans.columns))+0.5)
ax.set_yticklabels(trans.index, fontsize=16, color='k')
ax.set_xticklabels(trans.columns, fontsize=16, color='k')
ax.tick_params(direction='out', pad=10)
ax.set_frame_on(True)

for tk1, tk2 in zip(ax.xaxis.get_major_ticks(), ax.yaxis.get_major_ticks()):
    tk1.tick1On, tk2.tick1On, tk1.tick2On, tk2.tick2On = [False]*4

plt.show()

Rinse and repeat for all of your groups and second and third transitions.

Estimating transition probabilities (pandas)

Answers (1)

Related Questions