JakeCowton
JakeCowton

Reputation: 1445

Conditional split of pandas DataFrame

Given a DataFrame that looks like this

Index   Time               Val 
1       2017-06-29 17:48    0
2       2017-06-29 17:49    0
3       2017-06-29 17:50    1
4       2017-06-29 17:51    2
5       2017-06-29 17:52    3
6       2017-06-29 17:53    0
7       2017-06-29 17:54    0
8       2017-06-29 17:55    0
9       2017-06-29 17:56    0
10      2017-06-29 17:57    0

How can I split it into two lists of dataframes where one list is the blocks where val == 0 and one is the blocks where val > 0 (val < 0 does not occur). However, there is the following complication. When val > 0, the next time val == 0 should start a new data frame and vice versa.

So based on the data frame above, the first list would contain two dataframes: one containing indexes 1-2 and one containing the indexes 6-9. The second list should contain one dataframe of indexes 3-5.

I know I can strip out the val > 0 lines by doing df[df.val == 0] which would give the data frame below, but because of the break between indexes 2 and 6, I need them in different data frames.

Index   Time               Val 
1       2017-06-29 17:48    0
2       2017-06-29 17:49    0
6       2017-06-29 17:53    0
7       2017-06-29 17:54    0
8       2017-06-29 17:55    0
9       2017-06-29 17:56    0
10      2017-06-29 17:57    0

N.B. This needs to scale to a large data frame (millions of rows) so speed is desirable. Going through each row and looking for a break in the indexes (or timestamps) is not preferable.

Upvotes: 1

Views: 1142

Answers (1)

Abdou
Abdou

Reputation: 13274

I cannot guarantee that the following will run fast, but it should get you pretty far. The idea behind this is the use of what is known as run-length encoding on the Val column to generate a new column by which to group your dataframe. The following should serve as a decent start:

import pandas as pd
from pandas import Timestamp
from itertools import groupby
from functools import reduce


d = {'Time': [Timestamp('2017-06-29 17:48:00'),
  Timestamp('2017-06-29 17:49:00'),
  Timestamp('2017-06-29 17:50:00'),
  Timestamp('2017-06-29 17:51:00'),
  Timestamp('2017-06-29 17:52:00'),
  Timestamp('2017-06-29 17:53:00'),
  Timestamp('2017-06-29 17:54:00'),
  Timestamp('2017-06-29 17:55:00'),
  Timestamp('2017-06-29 17:56:00'),
  Timestamp('2017-06-29 17:57:00')],
 'Val': [0, 0, 1, 2, 3, 0, 0, 0, 0, 0]}

df = pd.DataFrame(d)

df['grouper'] = reduce(list.__add__, ([x]*len(list(y[1])) for x, y in enumerate(groupby(df.Val, key=lambda x: x > 0))))

bins = [[], []]
for _, frame in df.groupby('grouper'):
    if (frame.Val == 0).all():
        bins[0].append(frame.iloc[:, :-1])
    else:
        bins[1].append(frame.iloc[:, :-1])


print(bins)

Should yield the following list:

# [[                 Time  Val
# 0 2017-06-29 17:48:00    0
# 1 2017-06-29 17:49:00    0,                  Time  Val
# 5 2017-06-29 17:53:00    0
# 6 2017-06-29 17:54:00    0
# 7 2017-06-29 17:55:00    0
# 8 2017-06-29 17:56:00    0
# 9 2017-06-29 17:57:00    0], [                 Time  Val
# 2 2017-06-29 17:50:00    1
# 3 2017-06-29 17:51:00    2
# 4 2017-06-29 17:52:00    3]]

The idea here is that you are applying run-length encoding on column Val, which basically means that you compute the lengths of runs of equal values. The output of this process is saved in a new column called grouper. It is this column that is used to group your initial dataframe. Once the grouping is done, then you move forward with placing the separate dataframes in your bins list with a for-loop.

While I cannot vouch for the speed, I believe this idea gets you your desired output very easily. You could try to implement the run-length encoding idea with numpy for a little bit of speed.

Edit:

As mentioned, when groupby(df.Val, key=lambda x: x > 0) is called while df.Val contains null values, the condition is not met because NaN > 0 should return False. That being the case, the grouping becomes faulty, leading to an unexpected output. Since the goal is distinguish between values equal to 0 and those that are not, then you can change the function passed to the key parameter when grouping with groupby(df.Val, key=lambda x: x == 0). The following is almost the same as what's above, with the only exception being == instead of >:

d = {'Time': [Timestamp('2017-06-29 17:48:00'),
  Timestamp('2017-06-29 17:49:00'),
  Timestamp('2017-06-29 17:50:00'),
  Timestamp('2017-06-29 17:51:00'),
  Timestamp('2017-06-29 17:52:00'),
  Timestamp('2017-06-29 17:53:00'),
  Timestamp('2017-06-29 17:54:00'),
  Timestamp('2017-06-29 17:55:00'),
  Timestamp('2017-06-29 17:56:00'),
  Timestamp('2017-06-29 17:57:00'),
  Timestamp('2017-06-29 17:58:00'),
  Timestamp('2017-06-29 17:59:00')],
 'Val': [0, 0, 1, 2, 3, 0, None, 0, 0, 0, 0, None]}

df = pd.DataFrame(d)


df['grouper'] = reduce(list.__add__, ([x]*len(list(y[1])) for x, y in enumerate(groupby(df.Val, key=lambda x: x == 0))))

bins = [[], []]
for _, frame in df.groupby('grouper'):
    if (frame.Val == 0).all():
        bins[0].append(frame.iloc[:, :-1])
    else:
        bins[1].append(frame.iloc[:, :-1])

# [[                 Time  Val
# 0 2017-06-29 17:48:00  0.0
# 1 2017-06-29 17:49:00  0.0,                  Time  Val
# 5 2017-06-29 17:53:00  0.0,                   Time  Val
# 7  2017-06-29 17:55:00  0.0
# 8  2017-06-29 17:56:00  0.0
# 9  2017-06-29 17:57:00  0.0
# 10 2017-06-29 17:58:00  0.0], [                 Time  Val
# 2 2017-06-29 17:50:00  1.0
# 3 2017-06-29 17:51:00  2.0
# 4 2017-06-29 17:52:00  3.0,                  Time  Val
# 6 2017-06-29 17:54:00  NaN,                   Time  Val
# 11 2017-06-29 17:59:00  NaN]]

I hope this helps.

Upvotes: 1

Related Questions