pierre_j
pierre_j

Reputation: 983

Efficient groupby when rows of groups are contiguous?

The context

I am looking to apply a ufuncs (cumsum in this case) to blocks of contiguous rows in a time serie, which is stored in a panda DataFrame. This time serie is sorted according its DatetimeIndex.

Blocks are defined by a custom DatetimeIndex.

To do so, I came up with this (ok) code.

# input dataset
length = 10
ts = pd.date_range(start='2021/01/01 00:00', periods=length, freq='1h')
random.seed(1)
val = random.sample(range(1, 10+length), length)
df = pd.DataFrame({'val' : val}, index=ts)

# groupby custom datetimeindex
key_ts = [ts[i] for i in [1,3,7]]
df.loc[key_ts, 'id'] = range(len(key_ts))
df['id'] = df['id'].ffill()

# cumsum
df['cumsum'] = df.groupby('id')['val'].cumsum()
# initial dataset
In [13]: df
Out[13]: 
                     val
2021-01-01 00:00:00    5
2021-01-01 01:00:00    3
2021-01-01 02:00:00    9
2021-01-01 03:00:00    4
2021-01-01 04:00:00    8
2021-01-01 05:00:00   13
2021-01-01 06:00:00   15
2021-01-01 07:00:00   14
2021-01-01 08:00:00   11
2021-01-01 09:00:00    7
# DatetimeIndex defining custom time intervals for 'resampling'.
In [14]: key_ts
Out[14]: 
[Timestamp('2021-01-01 01:00:00', freq='H'),
 Timestamp('2021-01-01 03:00:00', freq='H'),
 Timestamp('2021-01-01 07:00:00', freq='H')]
# result
In [16]: df
Out[16]: 
                     val   id  cumsum
2021-01-01 00:00:00    5  NaN      -1
2021-01-01 01:00:00    3  0.0       3
2021-01-01 02:00:00    9  0.0      12
2021-01-01 03:00:00    4  1.0       4
2021-01-01 04:00:00    8  1.0      12
2021-01-01 05:00:00   13  1.0      25
2021-01-01 06:00:00   15  1.0      40
2021-01-01 07:00:00   14  2.0      14
2021-01-01 08:00:00   11  2.0      25
2021-01-01 09:00:00    7  2.0      32

The question

Is groupby the most efficient in terms of CPU and memory in this case where blocks are made with contiguous rows? I would think that with groupby, a 1st read of the full the dataset is made to identify all rows to group together.

Knowing rows are contiguous in my case, I don't need to read the full dataset to know I have gathered all the rows of current group. As soon as I hit the row of the next group, I know calculations are done with previous group.

In case rows are contiguous, the sorting step is lighter.

Hence the question, is there a way to mention this to pandas to save some CPU?

Thanks in advance for your feedbacks, Bests

Upvotes: 1

Views: 160

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50308

group_by is clearly not the fastest solution here because it should either use a slow sort or slow hashing operations to group the values.

What you want to implement is called a segmented cumulative sum. You can implement this quite efficiently using Numpy, but this is a bit tricky to implement (especially due to the NaN values) and not the fastest solution because multiple one need multiple steps iterating over all the id/valcolumns. The fastest solution is to use something like Numba to do this very quickly in one step.

Here is an implementation:

import numpy as np
import numba as nb

# To avoid the compilation cost at runtime, use: 
# @nb.njit('int64[:](float64[:],int64[:])')
@nb.njit
def segmentedCumSum(ids, values):
    size = len(ids)
    res = np.empty(size, dtype=values.dtype)
    if size == 0:
        return res
    zero = values.dtype.type(0)
    curValue = zero
    for i in range(size):
        if not np.isnan(ids[i]):
            if i > 0 and ids[i-1] != ids[i]:
                curValue = zero
            curValue += values[i]
            res[i] = curValue
        else:
            res[i] = -1
            curValue = zero
    return res

df['cumsum'] = segmentedCumSum(df['id'].to_numpy(), df['val'].to_numpy())

Note that ids[i-1] != ids[i] might fail with big floats because of their imprecision. The best solution is to use integers and -1 to replace the NaN value. If you do want to keep the float values, you can use the expression np.abs(ids[i-1]-ids[i]) > epsilon with a very small epsilon. See this for more information.

Upvotes: 1

Related Questions