Reputation: 115
I want to find how many times the values of a row hit max consecutively.
Ps1: My data has 500K rows, so I concern about speed of calculation
Ps2: In this example, startDay =1 and endDay=7 but some rows have different start or end day. (such as startDay=2, endDay=5 or startDay=4, endDay=3. arr_bool control this conditions)
My data:
import pandas as pd
import numpy as np
idx = ['id1', 'id2', 'id3', 'id4', 'id5',
'id6', 'id7', 'id8', 'id9', 'id10']
data = {'Day1':[0,0,1,0,1,1,0,0,1,1],
'Day2':[0,1,1,1,2,1,0,1,1,2],
'Day3':[1,3,1,1,1,0,0,1,3,2],
'Day4':[1,2,0,1,1,0,0,2,1,1],
'Day5':[0,2,1,1,1,1,0,2,1,1],
'Day6':[1,0,1,1,2,1,0,2,1,1],
'Day7':[0,0,0,1,1,1,0,0,3,1]}
startday = pd.DataFrame([1,1,1,1,1,1,1,1,1,1],columns=['start'], index=idx)
endday = pd.DataFrame([7,7,7,7,7,7,7,7,7,7],columns=['end'], index=idx)
df = pd.DataFrame(data, index=idx)
Neg99 = -999
Neg90 = -900
I should search the time interval for every rows.(like a loop startday to endday) I can find the max values in the time interval but I couldn't find the count of the values of a row hit max consecutively.
arr_bool = (np.less_equal.outer(startday.start, range(1,8))
& np.greater_equal.outer(endday.end, range(1,8))
)
df_result = pd.DataFrame(df.mask(~arr_bool).max(axis=1),
index=idx, columns=['result'])
Last conditions:
df_result.result= np.select( condlist = [startday.start > endday.end,
~arr_bool.any(axis=1)],
choicelist = [Neg99,Neg90],
default = df_result.result)
The result I want;
result_i_want = pd.DataFrame([2,1,3,6,1,3,0,3,1,2],columns=['result'], index=idx)
Here is @WeNYoBen 's solutions but this is running slow;
s=((df.eq(df.max(1),0))&(df.ne(0)))
s.apply(lambda x : x[x].groupby((~x).cumsum()).count().max(),1).fillna(0)
Upvotes: 3
Views: 167
Reputation: 53029
Here is another numpy
solution. First, timings compared to @piRSquared for reference. My code is ~14x faster on a large example while giving the exact same result.
# both methods give the expected result on small OP example
result
id1 2
id2 1
id3 3
id4 6
id5 1
id6 3
id7 0
id8 3
id9 1
id10 2
result
id1 2
id2 1
id3 3
id4 6
id5 1
id6 3
id7 0
id8 3
id9 1
id10 2
# timings on 50,000 rows random example
pp 12.89263810031116
pi 189.0821446024347
# comparison of results
result True
dtype: bool
Code:
import pandas as pd
import numpy as np
# OP example
idx = ['id1', 'id2', 'id3', 'id4', 'id5',
'id6', 'id7', 'id8', 'id9', 'id10']
data = {'Day1':[0,0,1,0,1,1,0,0,1,1],
'Day2':[0,1,1,1,2,1,0,1,1,2],
'Day3':[1,3,1,1,1,0,0,1,3,2],
'Day4':[1,2,0,1,1,0,0,2,1,1],
'Day5':[0,2,1,1,1,1,0,2,1,1],
'Day6':[1,0,1,1,2,1,0,2,1,1],
'Day7':[0,0,0,1,1,1,0,0,3,1]}
startday = pd.DataFrame([1,1,1,1,1,1,1,1,1,1],columns=['start'], index=idx)
endday = pd.DataFrame([7,7,7,7,7,7,7,7,7,7],columns=['end'], index=idx)
df = pd.DataFrame(data, index=idx)
Neg99 = -999
Neg90 = -900
# large example
IDX = [f'id{i}' for i in range(1,50_001)]
STARTDAY, ENDDAY = (pd.DataFrame({c:l}, index=IDX) for c,l in zip(('start','end'), np.sort(np.random.randint(1,8,(2,50_000)), axis=0)))
DF = pd.DataFrame({f'Day{i}':l for i,l in enumerate(np.random.randint(0,4,(7, 50_000)), 1)}, index=IDX)
def pp():
if restrict_max:
data = np.where((startday.values<=np.arange(1,8)) & (endday.values>=np.arange(1,8)), df.values, 0)
mask = data==np.maximum((data==0).all(1), data.max(1))[:, None]
else:
mask = (df.values==np.maximum((df.values==0).all(1), df.values.max(1))[:, None]) & (startday.values<=np.arange(1,8)) & (endday.values>=np.arange(1,8))
y, x = np.where(np.diff(mask, axis=1, prepend=False, append=False))
y = y[::2]
x = x[1::2]-x[::2]
res = np.zeros(df.values.shape[:1], int)
nl = np.flatnonzero(np.diff(y, prepend=-1))
res[y[nl]] = np.maximum.reduceat(x, nl)
return pd.DataFrame({'result': res}, index=df.index)
def pi():
sd = startday.start.values
ed = endday.end.values
dr = ed - sd + 1
i = np.arange(len(df)).repeat(dr)
j = np.concatenate([np.arange(s - 1, e) for s, e in zip(sd, ed)])
v = df.values
mx = np.empty(len(v), dtype=v.dtype)
mx.fill(v.min())
np.maximum.at(mx, i, v[i, j])
b = np.ones((v.shape[0], v.shape[1] + 2), bool)
b[i, j + 1] = (v[i, j] != mx[i]) | (mx[i] == 0)
x, y = np.where(b)
y_ = np.diff(y)
mask = y_ > 0
y__ = y_[mask]
x__ = x[1:][mask]
c = np.empty(len(v), int)
c.fill(y__.min())
np.maximum.at(c, x__, y__)
return pd.DataFrame({'result': c - 1}, index=df.index)
restrict_max=True
print(pp())
print(pi())
df, startday, endday = DF, STARTDAY, ENDDAY
from timeit import timeit
print('pp', timeit(pp,number=10)*100)
print('pi', timeit(pi,number=10)*100)
print((pp()==pi()).all())
Upvotes: 2
Reputation: 30971
Try the following solution:
Start from defining a function to be applied to each row:
def fn(row):
sd = startday.loc[row.name, 'start'] - 1
ed = endday.loc[row.name, 'end']
rr = row.values[sd:ed]
vMax = rr.max()
if vMax > 0:
gr = itertools.groupby(rr)
return max([ len(list(grp)) for key, grp in gr if key == vMax ])
else:
return 0
The key difference between itertools.groupby
(used above) and
pd.groupby
is that the itertools version starts a new group
on each change of the source value, so each group contains a series
of the same values (from the current row).
The first stage is to take the proper slice of the current row.
sd
is the starting index (including),ed
is the ending index (excluding),rr
is the proper slice (further referred to as row).If the max value in the current row is > 0 then the list comprehension used above:
The value returned by the function is the maximum from this list, i.e. the length of the longest sequence of maximal values.
If the current row contains only zeroes (max == 0) then return 0.
Then, the only thing to do is to apply the above function to each row:
df['result'] = df.apply(fn, axis=1)
Of course, you have to import itertools.
The advantage of my solution over the other answer is that it is significantly shorter.
Upvotes: 0
Reputation: 294218
The point of this effort is that OP asked for speed. This should help. If you have access to a JIT library like numba
, you should use that and just loop over each row.
sd = startday.start.values
ed = endday.end.values
dr = ed - sd + 1
i = np.arange(len(df)).repeat(dr)
j = np.concatenate([np.arange(s - 1, e) for s, e in zip(sd, ed)])
v = df.values
mx = np.empty(len(v), dtype=v.dtype)
mx.fill(v.min())
np.maximum.at(mx, i, v[i, j])
b = np.ones((v.shape[0], v.shape[1] + 2), bool)
b[i, j + 1] = (v[i, j] != mx[i]) | (mx[i] == 0)
x, y = np.where(b)
y_ = np.diff(y)
mask = y_ > 0
y__ = y_[mask]
x__ = x[1:][mask]
c = np.empty(len(v), int)
c.fill(y__.min())
np.maximum.at(c, x__, y__)
c - 1
array([2, 1, 3, 6, 1, 3, 0, 3, 1, 2])
I'll leave the obvious alone.
This represents the number of days in each interval
dr = ed - sd + 1
i
is the flattened relevant row indices for the corresponding flattened column indices in j
i = np.arange(len(df)).repeat(dr)
j = np.concatenate([np.arange(s - 1, e) for s, e in zip(sd, ed)])
mx
will be the maximum value for each interval.
b
will be a boolean array with width 2 more columns wide than v
. For this case it looks like:
# Buffer Buffer
# /--\ /--\
array([[ True, True, True, False, False, True, False, True, True],
[ True, True, True, False, True, True, True, True, True],
[ True, False, False, False, True, False, False, True, True],
[ True, True, False, False, False, False, False, False, True],
[ True, True, False, True, True, True, False, True, True],
[ True, False, False, True, True, False, False, False, True],
[ True, False, False, False, False, False, False, False, True],
[ True, True, True, True, False, False, False, True, True],
[ True, True, True, False, True, True, True, False, True],
[ True, True, False, False, True, True, True, True, True]])
The reason for the buffer columns is that I can calculate difference of positions after using np.where
Now I populate b
where the v
values are not equal to the maximum values in mx
# not equal to max is equal to zero
b[i, j + 1] = (v[i, j] != mx[i]) | (mx[i] == 0)
Then I find where those positions are in y
.
By taking the diff
, I find the number of positions from one instance of not equal to max to the next position of not equal to max. This will always be one greater than the number we're looking for but we'll correct that later.
Also, the diff
will reduce the length by one but in reality, there's a bunch of stuff we don't need because I don't need to take the difference from one row relative to a previous row. Fortunately, I can get rid of all zero or negative differences because they don't make sense.
I use np.maximum.at
(again) but this time on the differences to find the largest difference and that will be the longest length of consecutive max values for each row.
Mind that it's actually one more than that
Phew. I'm tired of typing...
Upvotes: 4