Paweł Rumian
Paweł Rumian

Reputation: 3816

Detection of concurrent event appearances - pandas and DataFrame, but possibly pure-python related

I am trying to find the fastest and most elegant solution to the situation described below.

I have a large DataFrame which contains the record of some machine states. For this example let's assume we have only two machines: one and two (in reality I've got much more, but this shouldn't matter). Every machine can be in one of two states: up and down.

Here is how an example record can be created:

import pandas as pd

df = pd.DataFrame({'time' : range(8),
'machine' : ["one","one","two","two","one","two","two","one"],
'status' : ["up","down","up","down","up","up","down","down"]})

and how it looks like:

time    machine status
0       one     up
1       one     down
2       two     up
3       two     down
4       one     up
5       two     up
6       two     down
7       one     down

The records are strictly ordered (in reality by timestamps).

Now, what I want to achieve is to find situations, where both machines were in a high state (up), and preferably return a subset of the DataFrame, showing such situations. In the example above, a subset between time 5 and 6 is interesting for me, as both one and two machines were in up state at that time.

The brute force solution would be probably to go through the data frame and store the state of all machines in a list, checking if we have two of them in an up state during every iteration, but maybe there is a more elegant solution?

All suggestions are more than welcome.

Upvotes: 0

Views: 622

Answers (1)

DSM
DSM

Reputation: 353059

To get started, how about something like

df2 = df.pivot(index="time", columns="machine", values="status")
df2 = df2.fillna(method='ffill')
both_up = df2[(df2 == "up").all(axis=1)]

which produces

>>> both_up
machine one two
time           
5        up  up

given your example.


First, build the initial DataFrame:

>>> df = pd.DataFrame({'time' : range(8),
'machine' : ["one","one","two","two","one","two","two","one"],
'status' : ["up","down","up","down","up","up","down","down"]})
>>> df
  machine status  time
0     one     up     0
1     one   down     1
2     two     up     2
3     two   down     3
4     one     up     4
5     two     up     5
6     two   down     6
7     one   down     7

Then pivot:

>>> df2 = df.pivot(index="time", columns="machine", values="status")
>>> df2
machine   one   two
time               
0          up   NaN
1        down   NaN
2         NaN    up
3         NaN  down
4          up   NaN
5         NaN    up
6         NaN  down
7        down   NaN

We don't know anything about the state of machine two before time 2 -- unless we assume that only transitions are recorded and not states, so it must have been down before -- but the other missing states we can guess assuming all the relevant changes are caught. IOW, we can forward-fill:

>>> df2 = df2.fillna(method='ffill')
>>> df2
machine   one   two
time               
0          up   NaN
1        down   NaN
2        down    up
3        down  down
4          up  down
5          up    up
6          up  down
7        down  down

And then we can look for the places where they're both up:

>>> (df2 == "up").all(axis=1)
time
0       False
1       False
2       False
3       False
4       False
5        True
6       False
7       False
dtype: bool
>>> both_up = df2[(df2 == "up").all(axis=1)]
>>> both_up
machine one two
time           
5        up  up

With just a little more effort you could get [5,6) or some duration measure out of it but hopefully the above will help you get started.

Upvotes: 3

Related Questions