Reputation: 115

how to find how many times the values of a row hit max consecutively

I want to find how many times the values of a row hit max consecutively.

Ps1: My data has 500K rows, so I concern about speed of calculation
Ps2: In this example, startDay =1 and endDay=7 but some rows have different start or end day. (such as startDay=2, endDay=5 or startDay=4, endDay=3. arr_bool control this conditions)

My data:

import pandas as pd
import numpy as np
idx = ['id1', 'id2', 'id3', 'id4', 'id5',
       'id6', 'id7', 'id8', 'id9', 'id10']
data = {'Day1':[0,0,1,0,1,1,0,0,1,1],
        'Day2':[0,1,1,1,2,1,0,1,1,2],
        'Day3':[1,3,1,1,1,0,0,1,3,2],
        'Day4':[1,2,0,1,1,0,0,2,1,1],
        'Day5':[0,2,1,1,1,1,0,2,1,1],
        'Day6':[1,0,1,1,2,1,0,2,1,1],
        'Day7':[0,0,0,1,1,1,0,0,3,1]}

startday = pd.DataFrame([1,1,1,1,1,1,1,1,1,1],columns=['start'], index=idx)
endday = pd.DataFrame([7,7,7,7,7,7,7,7,7,7],columns=['end'], index=idx)
df = pd.DataFrame(data, index=idx)
Neg99 = -999
Neg90 = -900

I should search the time interval for every rows.(like a loop startday to endday) I can find the max values in the time interval but I couldn't find the count of the values of a row hit max consecutively.

arr_bool = (np.less_equal.outer(startday.start, range(1,8)) 
            & np.greater_equal.outer(endday.end, range(1,8))
            )
df_result = pd.DataFrame(df.mask(~arr_bool).max(axis=1),
                                        index=idx, columns=['result'])

Last conditions:

df_result.result= np.select( condlist = [startday.start > endday.end,
                                         ~arr_bool.any(axis=1)],
                         choicelist = [Neg99,Neg90], 
                         default = df_result.result)

The result I want;

result_i_want = pd.DataFrame([2,1,3,6,1,3,0,3,1,2],columns=['result'], index=idx)

Here is @WeNYoBen 's solutions but this is running slow;

s=((df.eq(df.max(1),0))&(df.ne(0)))
s.apply(lambda x : x[x].groupby((~x).cumsum()).count().max(),1).fillna(0)

Upvotes: 3

Answers (3)

Paul Panzer

Reputation: 53029

Here is another numpy solution. First, timings compared to @piRSquared for reference. My code is ~14x faster on a large example while giving the exact same result.

# both methods give the expected result on small OP example                                                        
      result                                                                                                    
id1        2                                                                                                    
id2        1                                                                                                    
id3        3                                                                                                    
id4        6                                                                                                    
id5        1                                                                                                    
id6        3                                                                                                    
id7        0                                                                                                    
id8        3                                                                                                    
id9        1                                                                                                    
id10       2                                                                                                    
      result                                                                                                    
id1        2                                                                                                    
id2        1                                                                                                    
id3        3                                                                                                    
id4        6                                                                                                    
id5        1                                                                                                    
id6        3                                                                                                    
id7        0                                                                                                    
id8        3                                                                                                    
id9        1
id10       2

# timings on 50,000 rows random example
pp 12.89263810031116
pi 189.0821446024347
# comparison of results
result    True
dtype: bool

Code:

import pandas as pd
import numpy as np

# OP example
idx = ['id1', 'id2', 'id3', 'id4', 'id5',
       'id6', 'id7', 'id8', 'id9', 'id10']
data = {'Day1':[0,0,1,0,1,1,0,0,1,1],
        'Day2':[0,1,1,1,2,1,0,1,1,2],
        'Day3':[1,3,1,1,1,0,0,1,3,2],
        'Day4':[1,2,0,1,1,0,0,2,1,1],
        'Day5':[0,2,1,1,1,1,0,2,1,1],
        'Day6':[1,0,1,1,2,1,0,2,1,1],
        'Day7':[0,0,0,1,1,1,0,0,3,1]}

startday = pd.DataFrame([1,1,1,1,1,1,1,1,1,1],columns=['start'], index=idx)
endday = pd.DataFrame([7,7,7,7,7,7,7,7,7,7],columns=['end'], index=idx)
df = pd.DataFrame(data, index=idx)
Neg99 = -999
Neg90 = -900

# large example
IDX = [f'id{i}' for i in range(1,50_001)]
STARTDAY, ENDDAY = (pd.DataFrame({c:l}, index=IDX) for c,l in zip(('start','end'), np.sort(np.random.randint(1,8,(2,50_000)), axis=0)))
DF = pd.DataFrame({f'Day{i}':l for i,l in enumerate(np.random.randint(0,4,(7, 50_000)), 1)}, index=IDX)

def pp():
    if restrict_max:
        data = np.where((startday.values<=np.arange(1,8)) & (endday.values>=np.arange(1,8)), df.values, 0)
        mask = data==np.maximum((data==0).all(1), data.max(1))[:, None]
    else:
        mask = (df.values==np.maximum((df.values==0).all(1), df.values.max(1))[:, None]) & (startday.values<=np.arange(1,8)) & (endday.values>=np.arange(1,8))
    y, x = np.where(np.diff(mask, axis=1, prepend=False, append=False))
    y = y[::2]
    x = x[1::2]-x[::2]
    res = np.zeros(df.values.shape[:1], int)
    nl = np.flatnonzero(np.diff(y, prepend=-1))
    res[y[nl]] = np.maximum.reduceat(x, nl)
    return pd.DataFrame({'result': res}, index=df.index)

def pi():
    sd = startday.start.values
    ed = endday.end.values

    dr = ed - sd + 1

    i = np.arange(len(df)).repeat(dr)
    j = np.concatenate([np.arange(s - 1, e) for s, e in zip(sd, ed)])

    v = df.values

    mx = np.empty(len(v), dtype=v.dtype)
    mx.fill(v.min())
    np.maximum.at(mx, i, v[i, j])

    b = np.ones((v.shape[0], v.shape[1] + 2), bool)

    b[i, j + 1] = (v[i, j] != mx[i]) | (mx[i] == 0)

    x, y = np.where(b)

    y_ = np.diff(y)
    mask = y_ > 0
    y__ = y_[mask]
    x__ = x[1:][mask]

    c = np.empty(len(v), int)
    c.fill(y__.min())
    np.maximum.at(c, x__, y__)

    return pd.DataFrame({'result': c - 1}, index=df.index)

restrict_max=True

print(pp())
print(pi())
df, startday, endday = DF, STARTDAY, ENDDAY

from timeit import timeit

print('pp', timeit(pp,number=10)*100)
print('pi', timeit(pi,number=10)*100)
print((pp()==pi()).all())

Upvotes: 2

Valdi_Bo

Reputation: 30971

Try the following solution:

Start from defining a function to be applied to each row:

def fn(row):
    sd = startday.loc[row.name, 'start'] - 1
    ed = endday.loc[row.name, 'end']
    rr = row.values[sd:ed]
    vMax = rr.max()
    if vMax > 0:
        gr = itertools.groupby(rr)
        return max([ len(list(grp)) for key, grp in gr if key == vMax ])
    else:
        return 0

The key difference between itertools.groupby (used above) and pd.groupby is that the itertools version starts a new group on each change of the source value, so each group contains a series of the same values (from the current row).

The first stage is to take the proper slice of the current row.

sd is the starting index (including),
ed is the ending index (excluding),
rr is the proper slice (further referred to as row).

If the max value in the current row is > 0 then the list comprehension used above:

Reads key and grp (the current group) from the result of groupby.
If the key (the value included in the current group) is the maximal value, then the value added to the result is the length of the current group.

The value returned by the function is the maximum from this list, i.e. the length of the longest sequence of maximal values.

If the current row contains only zeroes (max == 0) then return 0.

Then, the only thing to do is to apply the above function to each row:

df['result'] = df.apply(fn, axis=1)

Of course, you have to import itertools.

The advantage of my solution over the other answer is that it is significantly shorter.

Upvotes: 0

piRSquared

Reputation: 294218

Pure Numpy slicing and stuff

The point of this effort is that OP asked for speed. This should help. If you have access to a JIT library like numba, you should use that and just loop over each row.

sd = startday.start.values
ed = endday.end.values

dr = ed - sd + 1

i = np.arange(len(df)).repeat(dr)
j = np.concatenate([np.arange(s - 1, e) for s, e in zip(sd, ed)])

v = df.values

mx = np.empty(len(v), dtype=v.dtype)
mx.fill(v.min())
np.maximum.at(mx, i, v[i, j])

b = np.ones((v.shape[0], v.shape[1] + 2), bool)

b[i, j + 1] = (v[i, j] != mx[i]) | (mx[i] == 0)

x, y = np.where(b)

y_ = np.diff(y)
mask = y_ > 0
y__ = y_[mask]
x__ = x[1:][mask]

c = np.empty(len(v), int)
c.fill(y__.min())
np.maximum.at(c, x__, y__)

c - 1

array([2, 1, 3, 6, 1, 3, 0, 3, 1, 2])

Explanation

I'll leave the obvious alone.

This represents the number of days in each interval

dr = ed - sd + 1

i is the flattened relevant row indices for the corresponding flattened column indices in j

i = np.arange(len(df)).repeat(dr)
j = np.concatenate([np.arange(s - 1, e) for s, e in zip(sd, ed)])

mx will be the maximum value for each interval.

b will be a boolean array with width 2 more columns wide than v. For this case it looks like:

#       Buffer                                                  Buffer
#        /--\                                                    /--\
array([[ True,  True,  True, False, False,  True, False,  True,  True],
       [ True,  True,  True, False,  True,  True,  True,  True,  True],
       [ True, False, False, False,  True, False, False,  True,  True],
       [ True,  True, False, False, False, False, False, False,  True],
       [ True,  True, False,  True,  True,  True, False,  True,  True],
       [ True, False, False,  True,  True, False, False, False,  True],
       [ True, False, False, False, False, False, False, False,  True],
       [ True,  True,  True,  True, False, False, False,  True,  True],
       [ True,  True,  True, False,  True,  True,  True, False,  True],
       [ True,  True, False, False,  True,  True,  True,  True,  True]])

The reason for the buffer columns is that I can calculate difference of positions after using np.where

Now I populate b where the v values are not equal to the maximum values in mx

 #             not equal to max       is equal to zero
 b[i, j + 1] = (v[i, j] != mx[i]) | (mx[i] == 0)

Then I find where those positions are in y.

By taking the diff, I find the number of positions from one instance of not equal to max to the next position of not equal to max. This will always be one greater than the number we're looking for but we'll correct that later.

Also, the diff will reduce the length by one but in reality, there's a bunch of stuff we don't need because I don't need to take the difference from one row relative to a previous row. Fortunately, I can get rid of all zero or negative differences because they don't make sense.

I use np.maximum.at (again) but this time on the differences to find the largest difference and that will be the longest length of consecutive max values for each row.

Mind that it's actually one more than that

Phew. I'm tired of typing...

Upvotes: 4

how to find how many times the values of a row hit max consecutively

Answers (3)

Pure Numpy slicing and stuff

Explanation

Related Questions