DdD
DdD

Reputation: 612

Transform sequences of values in a column to rows for a timeseries of events in Pandas

I am working with a timeseries with certain events that occur with a given order: A->B->C->D and I want to create a new DataFrame having as columns the time of these events, namely from the DataFrame old_df:

    ev_type       ev_time
1     W      2012-05-27 02:06:01
2     A      2012-05-28 02:06:01
3     B      2012-05-28 03:06:01
4     C      2012-05-28 04:06:01
5     D      2012-05-28 02:06:03
6     K      2012-05-28 02:06:01
...   ...    ...................
60000 D      2016-01-01 01:01:01

I'd like to get df:

              A_time               B_time               C_time                D_time
1       2012-05-28 02:06:01  2012-05-28 03:06:01  2012-05-28 04:06:01  2012-05-28 04:06:01
...             ....             ....               ....                    ....
5000    2015-05-28 02:06:01  2015-06-28 02:06:01 2015-07-28 02:06:01 2015-08-28 02:06:01

What I did is

A_events = old_df.evtype == 'A'
df = old_df[A_events ].ev_time.to_frame()
df.rename(columns={"ev_time":"A_time"},inplace=True)
df.join(old_df[A_events.shift(1).fillna(False)].ev_time.shift(-1),axis=1)

But this last line doesn't work because it doesn't change the index. The best I could get is

     A_time               B_time 
2  2012-05-28 02:06:01    NaT
3   NaT                  2012-05-28 03:06:01

How can I align the two Series? Or are there better strategies to extract a sequence of event or a pattern from a pandas dataframe?

EDIT

Following the code suggested by @Stefan below, a generator for my data is

df = pd.DataFrame(data={'ev_type': np.random.choice(list("ABCDWK"), size=100,replace=True), 'ev_time': pd.date_range(start=pd.datetime(2016,1,1),freq='M', periods=100)})

Upvotes: 1

Views: 390

Answers (2)

Stefan
Stefan

Reputation: 42885

After clarifying the question, you could:

Sample data:

size = 10000
df = pd.DataFrame(data={'ev_type': np.random.choice(list("ABCDWK"), size=size, replace=True),
                        'ev_time': pd.date_range(start=pd.datetime(2016, 1, 1), freq='Min', periods=size)})


seq = 'ABCD'
df['sequence'] = list(range(len(seq))) * int((size/len(seq)))
df = df.set_index('sequence', append=True).unstack().fillna(method='ffill').fillna(method='bfill').drop_duplicates()
df['ev_type'] = df.loc[:, 'ev_type'].sum(axis=1)
df.loc[df[('ev_type', 0)] == 'ABCD'].loc[:, 'ev_time'].rename(columns={c: seq[c] for c in df.loc[:, 'ev_time'].columns})

to get:

sequence                   A                   B                   C  \
1197     2016-01-01 19:56:00 2016-01-01 19:57:00 2016-01-01 19:54:00   
1198     2016-01-01 19:56:00 2016-01-01 19:57:00 2016-01-01 19:58:00   
2498     2016-01-02 17:36:00 2016-01-02 17:37:00 2016-01-02 17:38:00   
4269     2016-01-03 23:08:00 2016-01-03 23:09:00 2016-01-03 23:06:00   
4946     2016-01-04 10:24:00 2016-01-04 10:25:00 2016-01-04 10:26:00   
4947     2016-01-04 10:24:00 2016-01-04 10:25:00 2016-01-04 10:26:00   
5476     2016-01-04 19:16:00 2016-01-04 19:13:00 2016-01-04 19:14:00   
6808     2016-01-05 17:28:00 2016-01-05 17:25:00 2016-01-05 17:26:00   
6809     2016-01-05 17:28:00 2016-01-05 17:29:00 2016-01-05 17:26:00   
6810     2016-01-05 17:28:00 2016-01-05 17:29:00 2016-01-05 17:30:00   
6811     2016-01-05 17:28:00 2016-01-05 17:29:00 2016-01-05 17:30:00   

sequence                   D  
1197     2016-01-01 19:55:00  
1198     2016-01-01 19:55:00  
2498     2016-01-02 17:35:00  
4269     2016-01-03 23:07:00  
4946     2016-01-04 10:23:00  
4947     2016-01-04 10:27:00  
5476     2016-01-04 19:15:00  
6808     2016-01-05 17:27:00  
6809     2016-01-05 17:27:00  
6810     2016-01-05 17:27:00  
6811     2016-01-05 17:31:00

Upvotes: 1

DdD
DdD

Reputation: 612

For whoever visits this question looking for a similar issue, here I report how I did solve it. I am not sure it is the most pythonic/memory efficient way to look for event sequences...

To generate the data I used the code suggested by Stefan

size_of_df = 10000
df_old = pd.DataFrame(data={'ev_type': np.random.choice(list("ABCDWK"), size=size_of_df,replace=True), 'ev_time': pd.date_range(start=pd.datetime(2016,1,1),freq='h', periods=size_of_df)})

The sequence doesn't appear often, so the length of the df has to be big enough (or you have to be luck)

df_old.head(5)

              ev_time ev_type
0 2016-01-01 00:00:00       D
1 2016-01-01 01:00:00       D
2 2016-01-01 02:00:00       A
3 2016-01-01 03:00:00       C
4 2016-01-01 04:00:00       W

Then, I shifted the dataframe and glued it, to get all events in a row

sequence = "ABCD"
evnt = pd.concat([df_old.shift(-ix) for ix,let in enumerate(list(sequence))],axis=1,keys=list(sequence))

and looked for the sequence

tmp_evt = evnt.xs('ev_type',level=1,axis=1)
tmp_seq = tmp_evt.apply(lambda x: x.str.cat(),axis=1)
tmp_seq.head()

0    DDAC
1    DACW
2    ACWK
3    CWKD
4    WKDA
dtype: object

bool_sequence = tmp_seq == 'ABCD'
col_name=dict(zip(list(sequence),[ let +   "_time" for let in list(sequence)]))
evnt[bool_sequence].xs('ev_time',level=1,axis=1).rename(columns=col_name).head()


                  A_time              B_time              C_time  \
1648 2016-03-09 16:00:00 2016-03-09 17:00:00 2016-03-09 18:00:00   
2913 2016-05-01 09:00:00 2016-05-01 10:00:00 2016-05-01 11:00:00   
3803 2016-06-07 11:00:00 2016-06-07 12:00:00 2016-06-07 13:00:00   
3879 2016-06-10 15:00:00 2016-06-10 16:00:00 2016-06-10 17:00:00   
4730 2016-07-16 02:00:00 2016-07-16 03:00:00 2016-07-16 04:00:00   

                  D_time  
1648 2016-03-09 19:00:00  
2913 2016-05-01 12:00:00  
3803 2016-06-07 14:00:00  
3879 2016-06-10 18:00:00  
4730 2016-07-16 05:00:00  

Upvotes: 1

Related Questions