Reputation: 531
This is the current dataframe:
> ID Date current
> 2001980 10/30/2017 1
> 2001980 10/29/2017 0
> 2001980 10/28/2017 0
> 2001980 10/27/2017 40
> 2001980 10/26/2017 39
> 2001980 10/25/2017 0
> 2001980 10/24/2017 0
> 2001980 10/23/2017 60
> 2001980 10/22/2017 0
> 2001980 10/21/2017 0
> 2002222 10/21/2017 0
> 2002222 10/20/2017 0
> 2002222 10/19/2017 16
> 2002222 10/18/2017 0
> 2002222 10/17/2017 0
> 2002222 10/16/2017 20
> 2002222 10/15/2017 19
> 2002222 10/14/2017 18
Below is the final data frame. Column expected
is what I am trying to get.
thank you so much.
> ID Date current expected
> 2001980 10/30/2017 1 1
> 2001980 10/29/2017 0 0
> 2001980 10/28/2017 0 0
> 2001980 10/27/2017 40 40
> 2001980 10/26/2017 39 39
> 2001980 10/25/2017 0 38
> 2001980 10/24/2017 0 37
> 2001980 10/23/2017 60 60
> 2001980 10/22/2017 0 59
> 2001980 10/21/2017 0 58
> 2002222 10/21/2017 0 0
> 2002222 10/20/2017 0 0
> 2002222 10/19/2017 16 16
> 2002222 10/18/2017 0 15
> 2002222 10/17/2017 0 14
> 2002222 10/16/2017 20 20
> 2002222 10/15/2017 19 19
> 2002222 10/14/2017 18 18
I am using Excel with the formula below:
= if(this row's ID = last row's ID, max(last row's expected value - 1, this row's current value), this row's current value)
Upvotes: 6
Views: 9245
Reputation: 153460
Revised simpler:
df['expected'] = df.groupby(['ID',df.current.ne(0).cumsum()])['current']\
.transform(lambda x: x.eq(0).cumsum().mul(-1).add(x.iloc[0])).clip(0,np.inf)
Let's have a little fun:
df['expected'] = (df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill()) +
df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1))\
.clip(0,np.inf).fillna(0).astype(int)
print(df)
Output:
ID Date current expected
0 2001980 10/30/2017 1 1
1 2001980 10/29/2017 0 0
2 2001980 10/28/2017 0 0
3 2001980 10/27/2017 40 40
4 2001980 10/26/2017 39 39
5 2001980 10/25/2017 0 38
6 2001980 10/24/2017 0 37
7 2001980 10/23/2017 60 60
8 2001980 10/22/2017 0 59
9 2001980 10/21/2017 0 58
10 2002222 10/21/2017 0 0
11 2002222 10/20/2017 0 0
12 2002222 10/19/2017 16 16
13 2002222 10/18/2017 0 15
14 2002222 10/17/2017 0 14
15 2002222 10/16/2017 20 20
16 2002222 10/15/2017 19 19
17 2002222 10/14/2017 18 18
#Let's calculate two series first a series to fill the zeros in an 'ID' with the previous non-zero value
s1 = df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill())
s1
Output:
0 1.0
1 1.0
2 1.0
3 40.0
4 39.0
5 39.0
6 39.0
7 60.0
8 60.0
9 60.0
10 NaN
11 NaN
12 16.0
13 16.0
14 16.0
15 20.0
16 19.0
17 18.0
Name: current, dtype: float64
#Second series is a cumulative count of zeroes in a group by 'ID'
s2 = df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1)
s2
Output:
0 0
1 -1
2 -2
3 0
4 0
5 -1
6 -2
7 0
8 -1
9 -2
10 -1
11 -2
12 0
13 -1
14 -2
15 0
16 0
17 0
Name: current, dtype: int32
(s1 + s2).clip(0, np.inf).fillna(0)
Output:
0 1.0
1 0.0
2 0.0
3 40.0
4 39.0
5 38.0
6 37.0
7 60.0
8 59.0
9 58.0
10 0.0
11 0.0
12 16.0
13 15.0
14 14.0
15 20.0
16 19.0
17 18.0
Name: current, dtype: float64
Upvotes: 4
Reputation: 323226
Logic here should be work
lst=[]
for _, y in df.groupby('ID'):
z=[]
for i,(_, x) in enumerate(y.iterrows()):
print(x)
if x['current'] > 0:
z.append(x['current'])
else:
try:
z.append(max(z[i-1]-1,0))
except:
z.append(0)
lst.extend(z)
lst
Out[484]: [1, 0, 0, 40, 39, 38, 37, 60, 59, 58, 0, 0, 16, 15, 14, 20, 19, 18]
Upvotes: 0
Reputation: 13998
I believe @Tarun Lalwani had pointed you to one right direction. that is to save some critical information outside the DataFrame. the code can be simplified though, and there is nothing wrong with using global variables as long as you manage name properly. it's one of the design patterns which can often make things simpler and improve readability.
cached_last = { 'expected': None, 'ID': None }
def set_expected(x):
if cached_last['ID'] is None or x.ID != cached_last['ID']:
expected = x.current
else:
expected = max(cached_last['expected'] - 1, x.current)
cached_last['ID'] = x.ID
cached_last['expected'] = expected
return expected
df['expected'] = df.apply(set_expected, axis=1)
From the documentation on pandas.DataFrame.apply, do be careful about the potential side-effects of the apply
function.
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.
Upvotes: 0
Reputation: 1131
EDIT: To address OP's concern about scaling up to millions of rows.
Yes, my original answer will not scale to very large dataframes. However, with minor edits, this easy-to-read solution will scale. The optimizations that follow take advantage of the JIT compiler in Numba. After importing Numba, I add the jit decorator and modified the function to operate on a numpy arrays instead of the pandas objects. Numba is numpy-aware and cannot optimize pandas objects.
import numba
@numba.jit
def expected(id_col, current_col):
lexp = []
lstID = 0
expected = 0
for i in range(len(id_col)):
id, current = id_col[i], current_col[i]
if id == lstID:
expected = max(current, max(expected - 1, 0))
else:
expected = current
lexp.append(expected)
lstID = id
return np.array(lexp)
To pass a numpy array to the function, use the .values
attribute of the pandas series.
df1['expected'] = expected(df1.ID.values, df1.current.values)
To test the performance, I scaled up your original dataframe to more than 1 million rows.
df1 = df
while len(df1) < 1000000:
df1 = pd.concat([df1, df1])
df1.reset_index(inplace=True, drop=True)
The new changes perform very well.
%timeit expected(df1.ID.values, df1.current.values)
44.9 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
df1.shape
Out[65]: (1179648, 4)
df1.tail(15)
Out[66]:
ID Date current expected
1179633 2001980 10/27/2017 40 40
1179634 2001980 10/26/2017 39 39
1179635 2001980 10/25/2017 0 38
1179636 2001980 10/24/2017 0 37
1179637 2001980 10/23/2017 60 60
1179638 2001980 10/22/2017 0 59
1179639 2001980 10/21/2017 0 58
1179640 2002222 10/21/2017 0 0
1179641 2002222 10/20/2017 0 0
1179642 2002222 10/19/2017 16 16
1179643 2002222 10/18/2017 0 15
1179644 2002222 10/17/2017 0 14
1179645 2002222 10/16/2017 20 20
1179646 2002222 10/15/2017 19 19
1179647 2002222 10/14/2017 18 18
ORIGINAL ANSWER
A little brute force but really easy to follow.
def expected(df):
lexp = []
lstID = None
expected = 0
for i in range(len(df)):
id, current = df[['ID', 'current']].iloc[i]
if id == lstID:
expected = max(expected - 1, 0)
expected = max(current, expected)
else:
expected = current
lexp.append(expected)
lstID = id
return pd.Series(lexp)
Output
df['expected'] = expected(df)
df
Out[53]:
ID Date current expected
0 2001980 10/30/2017 1 1
1 2001980 10/29/2017 0 0
2 2001980 10/28/2017 0 0
3 2001980 10/27/2017 40 40
4 2001980 10/26/2017 39 39
5 2001980 10/25/2017 0 38
6 2001980 10/24/2017 0 37
7 2001980 10/23/2017 60 60
8 2001980 10/22/2017 0 59
9 2001980 10/21/2017 0 58
10 2002222 10/21/2017 0 0
11 2002222 10/20/2017 0 0
12 2002222 10/19/2017 16 16
13 2002222 10/18/2017 0 15
14 2002222 10/17/2017 0 14
15 2002222 10/16/2017 20 20
16 2002222 10/15/2017 19 19
17 2002222 10/14/2017 18 18
Upvotes: 0
Reputation: 146510
So you can do this used apply
and nested functions
import pandas as pd
ID = [2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2002222,2002222,2002222,2002222,2002222,2002222,2002222,2002222,]
Date = ["10/30/2017","10/29/2017","10/28/2017","10/27/2017","10/26/2017","10/25/2017","10/24/2017","10/23/2017","10/22/2017","10/21/2017","10/21/2017","10/20/2017","10/19/2017","10/18/2017","10/17/2017","10/16/2017","10/15/2017","10/14/2017",]
current = [1 ,0 ,0 ,40,39,0 ,0 ,60,0 ,0 ,0 ,0 ,16,0 ,0 ,20,19,18,]
df = pd.DataFrame({"ID": ID, "Date": Date, "current": current})
Then create the function to update the frame
Python 3.X
def update_frame(df):
last_expected = None
def apply_logic(row):
nonlocal last_expected
last_row_id = row.name - 1
if row.name == 0:
last_expected = row["current"]
return last_expected
last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
last_expected = max(last_expected-1,row['current']) if last_row['ID'] == row['ID'] else row['current']
return last_expected
return apply_logic
Python 2.X
def update_frame(df):
sd = {"last_expected": None}
def apply_logic(row):
last_row_id = row.name - 1
if row.name == 0:
sd['last_expected'] = row["current"]
return sd['last_expected']
last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
sd['last_expected'] = max(sd['last_expected'] - 1,row['current']) if last_row['ID'] == row['ID'] else row['current']
return sd['last_expected']
return apply_logic
And run the function like below
df['expected'] = df.apply(update_frame(df), axis=1)
The output is as expected
Upvotes: 3
Reputation: 169284
You can use a conditional statement combined with .shift()
to get previous row, and np.where
which AFAIK does not rely on loops as mentioned in a comment as something to avoid:
df['test'] = np.where(df['current'].shift() <
df['current'], df['current'] - 1, df['current'])
Result (I added a 'test'
column) with the result; you can change to 'expected'
if you so desire):
>>> df ID Date current expected test 0 2001980 10/30/2017 1 1 1 1 2001980 10/29/2017 0 0 0 2 2001980 10/28/2017 0 0 0 3 2001980 10/27/2017 40 40 39 4 2001980 10/26/2017 39 39 39 5 2001980 10/25/2017 38 38 38 6 2001980 10/24/2017 37 37 37 7 2001980 10/18/2017 0 36 0 8 2001980 10/17/2017 0 35 0 9 2001980 10/16/2017 60 60 59 10 2001980 10/15/2017 0 59 0 11 2001980 10/14/2017 0 58 0 12 2001980 10/13/2017 0 57 0 13 2001980 10/12/2017 0 56 0 14 2002222 10/21/2017 0 0 0 15 2002222 10/20/2017 0 0 0 16 2002222 10/19/2017 16 16 15 17 2002222 10/18/2017 0 15 0 18 2002222 10/17/2017 0 14 0 19 2002222 10/16/2017 13 13 12 20 2002222 10/15/2017 12 12 12 21 2002222 10/14/2017 11 11 11 22 2002222 10/13/2017 10 10 10 23 2002222 10/12/2017 9 9 9
Upvotes: 0