Reputation: 1445
Given a DataFrame that looks like this
Index Time Val
1 2017-06-29 17:48 0
2 2017-06-29 17:49 0
3 2017-06-29 17:50 1
4 2017-06-29 17:51 2
5 2017-06-29 17:52 3
6 2017-06-29 17:53 0
7 2017-06-29 17:54 0
8 2017-06-29 17:55 0
9 2017-06-29 17:56 0
10 2017-06-29 17:57 0
How can I split it into two lists of dataframes where one list is the blocks where val == 0
and one is the blocks where val > 0
(val < 0
does not occur). However, there is the following complication. When val > 0
, the next time val == 0
should start a new data frame and vice versa.
So based on the data frame above, the first list would contain two dataframes: one containing indexes 1-2 and one containing the indexes 6-9. The second list should contain one dataframe of indexes 3-5.
I know I can strip out the val > 0
lines by doing df[df.val == 0]
which would give the data frame below, but because of the break between indexes 2 and 6, I need them in different data frames.
Index Time Val
1 2017-06-29 17:48 0
2 2017-06-29 17:49 0
6 2017-06-29 17:53 0
7 2017-06-29 17:54 0
8 2017-06-29 17:55 0
9 2017-06-29 17:56 0
10 2017-06-29 17:57 0
N.B. This needs to scale to a large data frame (millions of rows) so speed is desirable. Going through each row and looking for a break in the indexes (or timestamps) is not preferable.
Upvotes: 1
Views: 1142
Reputation: 13274
I cannot guarantee that the following will run fast, but it should get you pretty far. The idea behind this is the use of what is known as run-length encoding
on the Val
column to generate a new column by which to group your dataframe. The following should serve as a decent start:
import pandas as pd
from pandas import Timestamp
from itertools import groupby
from functools import reduce
d = {'Time': [Timestamp('2017-06-29 17:48:00'),
Timestamp('2017-06-29 17:49:00'),
Timestamp('2017-06-29 17:50:00'),
Timestamp('2017-06-29 17:51:00'),
Timestamp('2017-06-29 17:52:00'),
Timestamp('2017-06-29 17:53:00'),
Timestamp('2017-06-29 17:54:00'),
Timestamp('2017-06-29 17:55:00'),
Timestamp('2017-06-29 17:56:00'),
Timestamp('2017-06-29 17:57:00')],
'Val': [0, 0, 1, 2, 3, 0, 0, 0, 0, 0]}
df = pd.DataFrame(d)
df['grouper'] = reduce(list.__add__, ([x]*len(list(y[1])) for x, y in enumerate(groupby(df.Val, key=lambda x: x > 0))))
bins = [[], []]
for _, frame in df.groupby('grouper'):
if (frame.Val == 0).all():
bins[0].append(frame.iloc[:, :-1])
else:
bins[1].append(frame.iloc[:, :-1])
print(bins)
Should yield the following list:
# [[ Time Val
# 0 2017-06-29 17:48:00 0
# 1 2017-06-29 17:49:00 0, Time Val
# 5 2017-06-29 17:53:00 0
# 6 2017-06-29 17:54:00 0
# 7 2017-06-29 17:55:00 0
# 8 2017-06-29 17:56:00 0
# 9 2017-06-29 17:57:00 0], [ Time Val
# 2 2017-06-29 17:50:00 1
# 3 2017-06-29 17:51:00 2
# 4 2017-06-29 17:52:00 3]]
The idea here is that you are applying run-length encoding
on column Val
, which basically means that you compute the lengths of runs of equal values. The output of this process is saved in a new column called grouper
. It is this column that is used to group your initial dataframe. Once the grouping is done, then you move forward with placing the separate dataframes in your bins
list with a for-loop
.
While I cannot vouch for the speed, I believe this idea gets you your desired output very easily. You could try to implement the run-length encoding
idea with numpy
for a little bit of speed.
As mentioned, when groupby(df.Val, key=lambda x: x > 0)
is called while df.Val
contains null values, the condition is not met because NaN > 0
should return False. That being the case, the grouping becomes faulty, leading to an unexpected output. Since the goal is distinguish between values equal to 0
and those that are not, then you can change the function passed to the key
parameter when grouping with groupby(df.Val, key=lambda x: x == 0)
. The following is almost the same as what's above, with the only exception being ==
instead of >
:
d = {'Time': [Timestamp('2017-06-29 17:48:00'),
Timestamp('2017-06-29 17:49:00'),
Timestamp('2017-06-29 17:50:00'),
Timestamp('2017-06-29 17:51:00'),
Timestamp('2017-06-29 17:52:00'),
Timestamp('2017-06-29 17:53:00'),
Timestamp('2017-06-29 17:54:00'),
Timestamp('2017-06-29 17:55:00'),
Timestamp('2017-06-29 17:56:00'),
Timestamp('2017-06-29 17:57:00'),
Timestamp('2017-06-29 17:58:00'),
Timestamp('2017-06-29 17:59:00')],
'Val': [0, 0, 1, 2, 3, 0, None, 0, 0, 0, 0, None]}
df = pd.DataFrame(d)
df['grouper'] = reduce(list.__add__, ([x]*len(list(y[1])) for x, y in enumerate(groupby(df.Val, key=lambda x: x == 0))))
bins = [[], []]
for _, frame in df.groupby('grouper'):
if (frame.Val == 0).all():
bins[0].append(frame.iloc[:, :-1])
else:
bins[1].append(frame.iloc[:, :-1])
# [[ Time Val
# 0 2017-06-29 17:48:00 0.0
# 1 2017-06-29 17:49:00 0.0, Time Val
# 5 2017-06-29 17:53:00 0.0, Time Val
# 7 2017-06-29 17:55:00 0.0
# 8 2017-06-29 17:56:00 0.0
# 9 2017-06-29 17:57:00 0.0
# 10 2017-06-29 17:58:00 0.0], [ Time Val
# 2 2017-06-29 17:50:00 1.0
# 3 2017-06-29 17:51:00 2.0
# 4 2017-06-29 17:52:00 3.0, Time Val
# 6 2017-06-29 17:54:00 NaN, Time Val
# 11 2017-06-29 17:59:00 NaN]]
I hope this helps.
Upvotes: 1