Reputation: 612
I am working with a timeseries with certain events that occur with a given order: A->B->C->D
and I want to create a new DataFrame
having as columns the time of these events, namely from the DataFrame
old_df
:
ev_type ev_time
1 W 2012-05-27 02:06:01
2 A 2012-05-28 02:06:01
3 B 2012-05-28 03:06:01
4 C 2012-05-28 04:06:01
5 D 2012-05-28 02:06:03
6 K 2012-05-28 02:06:01
... ... ...................
60000 D 2016-01-01 01:01:01
I'd like to get df
:
A_time B_time C_time D_time
1 2012-05-28 02:06:01 2012-05-28 03:06:01 2012-05-28 04:06:01 2012-05-28 04:06:01
... .... .... .... ....
5000 2015-05-28 02:06:01 2015-06-28 02:06:01 2015-07-28 02:06:01 2015-08-28 02:06:01
What I did is
A_events = old_df.evtype == 'A'
df = old_df[A_events ].ev_time.to_frame()
df.rename(columns={"ev_time":"A_time"},inplace=True)
df.join(old_df[A_events.shift(1).fillna(False)].ev_time.shift(-1),axis=1)
But this last line doesn't work because it doesn't change the index. The best I could get is
A_time B_time
2 2012-05-28 02:06:01 NaT
3 NaT 2012-05-28 03:06:01
How can I align the two Series? Or are there better strategies to extract a sequence of event or a pattern from a pandas dataframe?
Following the code suggested by @Stefan below, a generator for my data is
df = pd.DataFrame(data={'ev_type': np.random.choice(list("ABCDWK"), size=100,replace=True), 'ev_time': pd.date_range(start=pd.datetime(2016,1,1),freq='M', periods=100)})
Upvotes: 1
Views: 390
Reputation: 42885
After clarifying the question, you could:
Sample data:
size = 10000
df = pd.DataFrame(data={'ev_type': np.random.choice(list("ABCDWK"), size=size, replace=True),
'ev_time': pd.date_range(start=pd.datetime(2016, 1, 1), freq='Min', periods=size)})
seq = 'ABCD'
df['sequence'] = list(range(len(seq))) * int((size/len(seq)))
df = df.set_index('sequence', append=True).unstack().fillna(method='ffill').fillna(method='bfill').drop_duplicates()
df['ev_type'] = df.loc[:, 'ev_type'].sum(axis=1)
df.loc[df[('ev_type', 0)] == 'ABCD'].loc[:, 'ev_time'].rename(columns={c: seq[c] for c in df.loc[:, 'ev_time'].columns})
to get:
sequence A B C \
1197 2016-01-01 19:56:00 2016-01-01 19:57:00 2016-01-01 19:54:00
1198 2016-01-01 19:56:00 2016-01-01 19:57:00 2016-01-01 19:58:00
2498 2016-01-02 17:36:00 2016-01-02 17:37:00 2016-01-02 17:38:00
4269 2016-01-03 23:08:00 2016-01-03 23:09:00 2016-01-03 23:06:00
4946 2016-01-04 10:24:00 2016-01-04 10:25:00 2016-01-04 10:26:00
4947 2016-01-04 10:24:00 2016-01-04 10:25:00 2016-01-04 10:26:00
5476 2016-01-04 19:16:00 2016-01-04 19:13:00 2016-01-04 19:14:00
6808 2016-01-05 17:28:00 2016-01-05 17:25:00 2016-01-05 17:26:00
6809 2016-01-05 17:28:00 2016-01-05 17:29:00 2016-01-05 17:26:00
6810 2016-01-05 17:28:00 2016-01-05 17:29:00 2016-01-05 17:30:00
6811 2016-01-05 17:28:00 2016-01-05 17:29:00 2016-01-05 17:30:00
sequence D
1197 2016-01-01 19:55:00
1198 2016-01-01 19:55:00
2498 2016-01-02 17:35:00
4269 2016-01-03 23:07:00
4946 2016-01-04 10:23:00
4947 2016-01-04 10:27:00
5476 2016-01-04 19:15:00
6808 2016-01-05 17:27:00
6809 2016-01-05 17:27:00
6810 2016-01-05 17:27:00
6811 2016-01-05 17:31:00
Upvotes: 1
Reputation: 612
For whoever visits this question looking for a similar issue, here I report how I did solve it. I am not sure it is the most pythonic/memory efficient way to look for event sequences...
To generate the data I used the code suggested by Stefan
size_of_df = 10000
df_old = pd.DataFrame(data={'ev_type': np.random.choice(list("ABCDWK"), size=size_of_df,replace=True), 'ev_time': pd.date_range(start=pd.datetime(2016,1,1),freq='h', periods=size_of_df)})
The sequence doesn't appear often, so the length of the df has to be big enough (or you have to be luck)
df_old.head(5)
ev_time ev_type
0 2016-01-01 00:00:00 D
1 2016-01-01 01:00:00 D
2 2016-01-01 02:00:00 A
3 2016-01-01 03:00:00 C
4 2016-01-01 04:00:00 W
Then, I shifted the dataframe and glued it, to get all events in a row
sequence = "ABCD"
evnt = pd.concat([df_old.shift(-ix) for ix,let in enumerate(list(sequence))],axis=1,keys=list(sequence))
and looked for the sequence
tmp_evt = evnt.xs('ev_type',level=1,axis=1)
tmp_seq = tmp_evt.apply(lambda x: x.str.cat(),axis=1)
tmp_seq.head()
0 DDAC
1 DACW
2 ACWK
3 CWKD
4 WKDA
dtype: object
bool_sequence = tmp_seq == 'ABCD'
col_name=dict(zip(list(sequence),[ let + "_time" for let in list(sequence)]))
evnt[bool_sequence].xs('ev_time',level=1,axis=1).rename(columns=col_name).head()
A_time B_time C_time \
1648 2016-03-09 16:00:00 2016-03-09 17:00:00 2016-03-09 18:00:00
2913 2016-05-01 09:00:00 2016-05-01 10:00:00 2016-05-01 11:00:00
3803 2016-06-07 11:00:00 2016-06-07 12:00:00 2016-06-07 13:00:00
3879 2016-06-10 15:00:00 2016-06-10 16:00:00 2016-06-10 17:00:00
4730 2016-07-16 02:00:00 2016-07-16 03:00:00 2016-07-16 04:00:00
D_time
1648 2016-03-09 19:00:00
2913 2016-05-01 12:00:00
3803 2016-06-07 14:00:00
3879 2016-06-10 18:00:00
4730 2016-07-16 05:00:00
Upvotes: 1