Transform sequences of values in a column to rows for a timeseries of events in Pandas

Question

I am working with a timeseries with certain events that occur with a given order: A->B->C->D and I want to create a new DataFrame having as columns the time of these events, namely from the DataFrame old_df:

    ev_type       ev_time
1     W      2012-05-27 02:06:01
2     A      2012-05-28 02:06:01
3     B      2012-05-28 03:06:01
4     C      2012-05-28 04:06:01
5     D      2012-05-28 02:06:03
6     K      2012-05-28 02:06:01
...   ...    ...................
60000 D      2016-01-01 01:01:01

I'd like to get df:

              A_time               B_time               C_time                D_time
1       2012-05-28 02:06:01  2012-05-28 03:06:01  2012-05-28 04:06:01  2012-05-28 04:06:01
...             ....             ....               ....                    ....
5000    2015-05-28 02:06:01  2015-06-28 02:06:01 2015-07-28 02:06:01 2015-08-28 02:06:01

What I did is

A_events = old_df.evtype == 'A'
df = old_df[A_events ].ev_time.to_frame()
df.rename(columns={"ev_time":"A_time"},inplace=True)
df.join(old_df[A_events.shift(1).fillna(False)].ev_time.shift(-1),axis=1)

But this last line doesn't work because it doesn't change the index. The best I could get is

     A_time               B_time 
2  2012-05-28 02:06:01    NaT
3   NaT                  2012-05-28 03:06:01

How can I align the two Series? Or are there better strategies to extract a sequence of event or a pattern from a pandas dataframe?

EDIT

Following the code suggested by @Stefan below, a generator for my data is

df = pd.DataFrame(data={'ev_type': np.random.choice(list("ABCDWK"), size=100,replace=True), 'ev_time': pd.date_range(start=pd.datetime(2016,1,1),freq='M', periods=100)})

DdD · Accepted Answer

For whoever visits this question looking for a similar issue, here I report how I did solve it. I am not sure it is the most pythonic/memory efficient way to look for event sequences...

To generate the data I used the code suggested by Stefan

size_of_df = 10000
df_old = pd.DataFrame(data={'ev_type': np.random.choice(list("ABCDWK"), size=size_of_df,replace=True), 'ev_time': pd.date_range(start=pd.datetime(2016,1,1),freq='h', periods=size_of_df)})

The sequence doesn't appear often, so the length of the df has to be big enough (or you have to be luck)

df_old.head(5)

              ev_time ev_type
0 2016-01-01 00:00:00       D
1 2016-01-01 01:00:00       D
2 2016-01-01 02:00:00       A
3 2016-01-01 03:00:00       C
4 2016-01-01 04:00:00       W

Then, I shifted the dataframe and glued it, to get all events in a row

sequence = "ABCD"
evnt = pd.concat([df_old.shift(-ix) for ix,let in enumerate(list(sequence))],axis=1,keys=list(sequence))

and looked for the sequence

tmp_evt = evnt.xs('ev_type',level=1,axis=1)
tmp_seq = tmp_evt.apply(lambda x: x.str.cat(),axis=1)
tmp_seq.head()

0    DDAC
1    DACW
2    ACWK
3    CWKD
4    WKDA
dtype: object

bool_sequence = tmp_seq == 'ABCD'
col_name=dict(zip(list(sequence),[ let +   "_time" for let in list(sequence)]))
evnt[bool_sequence].xs('ev_time',level=1,axis=1).rename(columns=col_name).head()


                  A_time              B_time              C_time  \
1648 2016-03-09 16:00:00 2016-03-09 17:00:00 2016-03-09 18:00:00   
2913 2016-05-01 09:00:00 2016-05-01 10:00:00 2016-05-01 11:00:00   
3803 2016-06-07 11:00:00 2016-06-07 12:00:00 2016-06-07 13:00:00   
3879 2016-06-10 15:00:00 2016-06-10 16:00:00 2016-06-10 17:00:00   
4730 2016-07-16 02:00:00 2016-07-16 03:00:00 2016-07-16 04:00:00   

                  D_time  
1648 2016-03-09 19:00:00  
2913 2016-05-01 12:00:00  
3803 2016-06-07 14:00:00  
3879 2016-06-10 18:00:00  
4730 2016-07-16 05:00:00

Transform sequences of values in a column to rows for a timeseries of events in Pandas

EDIT

Answers (2)

Related Questions