qqqwww
qqqwww

Reputation: 531

python - use previous row's value to update the new rows values

This is the current dataframe:

> ID        Date    current
> 2001980   10/30/2017  1   
> 2001980   10/29/2017  0   
> 2001980   10/28/2017  0   
> 2001980   10/27/2017  40  
> 2001980   10/26/2017  39  
> 2001980   10/25/2017  0   
> 2001980   10/24/2017  0   
> 2001980   10/23/2017  60  
> 2001980   10/22/2017  0   
> 2001980   10/21/2017  0   
> 2002222   10/21/2017  0   
> 2002222   10/20/2017  0   
> 2002222   10/19/2017  16  
> 2002222   10/18/2017  0   
> 2002222   10/17/2017  0   
> 2002222   10/16/2017  20  
> 2002222   10/15/2017  19  
> 2002222   10/14/2017  18  

Below is the final data frame. Column expected is what I am trying to get.

  1. One ID might have multiple date/record/rows. (ID+Date) is unique.
  2. this row's expected value = last row's expected - 1
  3. the minimum value is 0.
  4. Based on the formula in 2, if this row's expected value < this row's current value, then use this row's current value. for example, for ID 2001980 on 10/23/2017. Based on rule 2, the value should be 36, but based on rule 4, 36<60, so we use 60.

thank you so much.

> ID        Date    current expected 
> 2001980   10/30/2017  1   1 
> 2001980   10/29/2017  0   0
> 2001980   10/28/2017  0   0 
> 2001980   10/27/2017  40  40
> 2001980   10/26/2017  39  39 
> 2001980   10/25/2017  0   38
> 2001980   10/24/2017  0   37 
> 2001980   10/23/2017  60  60
> 2001980   10/22/2017  0   59 
> 2001980   10/21/2017  0   58
> 2002222   10/21/2017  0   0
> 2002222   10/20/2017  0   0 
> 2002222   10/19/2017  16  16
> 2002222   10/18/2017  0   15 
> 2002222   10/17/2017  0   14
> 2002222   10/16/2017  20  20
> 2002222   10/15/2017  19  19
> 2002222   10/14/2017  18  18

I am using Excel with the formula below:

= if(this row's ID = last row's ID, max(last row's expected value - 1, this row's current value), this row's current value)

Upvotes: 6

Views: 9245

Answers (6)

Scott Boston
Scott Boston

Reputation: 153460

Revised simpler:

df['expected'] = df.groupby(['ID',df.current.ne(0).cumsum()])['current']\
  .transform(lambda x: x.eq(0).cumsum().mul(-1).add(x.iloc[0])).clip(0,np.inf)

Let's have a little fun:

df['expected'] = (df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill()) +
df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1))\
.clip(0,np.inf).fillna(0).astype(int)
print(df)

Output:

         ID        Date  current  expected
0   2001980  10/30/2017        1         1
1   2001980  10/29/2017        0         0
2   2001980  10/28/2017        0         0
3   2001980  10/27/2017       40        40
4   2001980  10/26/2017       39        39
5   2001980  10/25/2017        0        38
6   2001980  10/24/2017        0        37
7   2001980  10/23/2017       60        60
8   2001980  10/22/2017        0        59
9   2001980  10/21/2017        0        58
10  2002222  10/21/2017        0         0
11  2002222  10/20/2017        0         0
12  2002222  10/19/2017       16        16
13  2002222  10/18/2017        0        15
14  2002222  10/17/2017        0        14
15  2002222  10/16/2017       20        20
16  2002222  10/15/2017       19        19
17  2002222  10/14/2017       18        18

Details

Basically, creating series, s1 and subtracting series s2 then clipping negative values and filling nan's with zero.

#Let's calculate two series first a series to fill the zeros in an 'ID' with the previous non-zero value 
s1 = df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill())
s1

Output:

0      1.0
1      1.0
2      1.0
3     40.0
4     39.0
5     39.0
6     39.0
7     60.0
8     60.0
9     60.0
10     NaN
11     NaN
12    16.0
13    16.0
14    16.0
15    20.0
16    19.0
17    18.0
Name: current, dtype: float64

#Second series is a cumulative count of zeroes in a group by 'ID'
s2 = df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1)
s2

Output:

0     0
1    -1
2    -2
3     0
4     0
5    -1
6    -2
7     0
8    -1
9    -2
10   -1
11   -2
12    0
13   -1
14   -2
15    0
16    0
17    0
Name: current, dtype: int32

Add series together clip and fillna.

(s1 + s2).clip(0, np.inf).fillna(0)

Output:

0      1.0
1      0.0
2      0.0
3     40.0
4     39.0
5     38.0
6     37.0
7     60.0
8     59.0
9     58.0
10     0.0
11     0.0
12    16.0
13    15.0
14    14.0
15    20.0
16    19.0
17    18.0
Name: current, dtype: float64

Upvotes: 4

BENY
BENY

Reputation: 323226

Logic here should be work

lst=[]

for _, y in df.groupby('ID'):
    z=[]
    for i,(_, x) in enumerate(y.iterrows()):
        print(x)
        if x['current'] > 0:
           z.append(x['current'])
        else:
            try:
               z.append(max(z[i-1]-1,0))
            except:
               z.append(0)

    lst.extend(z)


lst

Out[484]: [1, 0, 0, 40, 39, 38, 37, 60, 59, 58, 0, 0, 16, 15, 14, 20, 19, 18]

Upvotes: 0

jxc
jxc

Reputation: 13998

I believe @Tarun Lalwani had pointed you to one right direction. that is to save some critical information outside the DataFrame. the code can be simplified though, and there is nothing wrong with using global variables as long as you manage name properly. it's one of the design patterns which can often make things simpler and improve readability.

cached_last = { 'expected': None, 'ID': None }

def set_expected(x):
    if cached_last['ID'] is None or x.ID != cached_last['ID']:
        expected = x.current
    else:
        expected =  max(cached_last['expected'] - 1, x.current)
    cached_last['ID'] = x.ID
    cached_last['expected'] = expected
    return expected

df['expected'] = df.apply(set_expected, axis=1)

From the documentation on pandas.DataFrame.apply, do be careful about the potential side-effects of the apply function.

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

Upvotes: 0

floydn
floydn

Reputation: 1131

EDIT: To address OP's concern about scaling up to millions of rows.

Yes, my original answer will not scale to very large dataframes. However, with minor edits, this easy-to-read solution will scale. The optimizations that follow take advantage of the JIT compiler in Numba. After importing Numba, I add the jit decorator and modified the function to operate on a numpy arrays instead of the pandas objects. Numba is numpy-aware and cannot optimize pandas objects.

import numba

@numba.jit
def expected(id_col, current_col):
    lexp = []
    lstID = 0
    expected = 0
    for i in range(len(id_col)):
        id, current = id_col[i], current_col[i]
        if id == lstID:
            expected = max(current, max(expected - 1, 0))
        else:
            expected = current
        lexp.append(expected)
        lstID = id
    return np.array(lexp)

To pass a numpy array to the function, use the .values attribute of the pandas series.

df1['expected'] = expected(df1.ID.values, df1.current.values)

To test the performance, I scaled up your original dataframe to more than 1 million rows.

df1 = df
while len(df1) < 1000000:
    df1 = pd.concat([df1, df1])
df1.reset_index(inplace=True, drop=True)

The new changes perform very well.

%timeit expected(df1.ID.values, df1.current.values)
44.9 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

df1.shape
Out[65]: (1179648, 4)

df1.tail(15)
Out[66]: 
              ID        Date  current  expected
1179633  2001980  10/27/2017       40        40
1179634  2001980  10/26/2017       39        39
1179635  2001980  10/25/2017        0        38
1179636  2001980  10/24/2017        0        37
1179637  2001980  10/23/2017       60        60
1179638  2001980  10/22/2017        0        59
1179639  2001980  10/21/2017        0        58
1179640  2002222  10/21/2017        0         0
1179641  2002222  10/20/2017        0         0
1179642  2002222  10/19/2017       16        16
1179643  2002222  10/18/2017        0        15
1179644  2002222  10/17/2017        0        14
1179645  2002222  10/16/2017       20        20
1179646  2002222  10/15/2017       19        19
1179647  2002222  10/14/2017       18        18

ORIGINAL ANSWER

A little brute force but really easy to follow.

    def expected(df):
        lexp = []
        lstID = None
        expected = 0
        for i in range(len(df)):
            id, current = df[['ID', 'current']].iloc[i]
            if id == lstID:
                expected = max(expected - 1, 0)
                expected = max(current, expected)
            else:
                expected = current
            lexp.append(expected)
            lstID = id
        return pd.Series(lexp)

Output

df['expected'] = expected(df)

df
Out[53]: 
         ID        Date  current  expected
0   2001980  10/30/2017        1         1
1   2001980  10/29/2017        0         0
2   2001980  10/28/2017        0         0
3   2001980  10/27/2017       40        40
4   2001980  10/26/2017       39        39
5   2001980  10/25/2017        0        38
6   2001980  10/24/2017        0        37
7   2001980  10/23/2017       60        60
8   2001980  10/22/2017        0        59
9   2001980  10/21/2017        0        58
10  2002222  10/21/2017        0         0
11  2002222  10/20/2017        0         0
12  2002222  10/19/2017       16        16
13  2002222  10/18/2017        0        15
14  2002222  10/17/2017        0        14
15  2002222  10/16/2017       20        20
16  2002222  10/15/2017       19        19
17  2002222  10/14/2017       18        18

Upvotes: 0

Tarun Lalwani
Tarun Lalwani

Reputation: 146510

So you can do this used apply and nested functions

import pandas as pd
ID = [2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2002222,2002222,2002222,2002222,2002222,2002222,2002222,2002222,]
Date = ["10/30/2017","10/29/2017","10/28/2017","10/27/2017","10/26/2017","10/25/2017","10/24/2017","10/23/2017","10/22/2017","10/21/2017","10/21/2017","10/20/2017","10/19/2017","10/18/2017","10/17/2017","10/16/2017","10/15/2017","10/14/2017",]
current = [1 ,0 ,0 ,40,39,0 ,0 ,60,0 ,0 ,0 ,0 ,16,0 ,0 ,20,19,18,]

df = pd.DataFrame({"ID": ID, "Date": Date, "current": current})

Then create the function to update the frame

Python 3.X

def update_frame(df):
    last_expected = None
    def apply_logic(row):
        nonlocal last_expected
        last_row_id = row.name - 1
        if row.name == 0:
            last_expected = row["current"]
            return last_expected
        last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
        last_expected = max(last_expected-1,row['current']) if last_row['ID'] == row['ID'] else row['current']
        return last_expected
    return apply_logic

Python 2.X

def update_frame(df):
    sd = {"last_expected": None}
    def apply_logic(row):
        last_row_id = row.name - 1
        if row.name == 0:
            sd['last_expected'] = row["current"]
            return sd['last_expected']
        last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
        sd['last_expected'] = max(sd['last_expected'] - 1,row['current']) if last_row['ID'] == row['ID'] else row['current']
        return sd['last_expected']
    return apply_logic

And run the function like below

df['expected'] = df.apply(update_frame(df), axis=1)

The output is as expected

Output

Upvotes: 3

mechanical_meat
mechanical_meat

Reputation: 169284

You can use a conditional statement combined with .shift() to get previous row, and np.where which AFAIK does not rely on loops as mentioned in a comment as something to avoid:

df['test'] = np.where(df['current'].shift() < 
                      df['current'], df['current'] - 1, df['current'])

Result (I added a 'test' column) with the result; you can change to 'expected' if you so desire):

>>> df
         ID        Date  current  expected  test
0   2001980  10/30/2017        1         1     1
1   2001980  10/29/2017        0         0     0
2   2001980  10/28/2017        0         0     0
3   2001980  10/27/2017       40        40    39
4   2001980  10/26/2017       39        39    39
5   2001980  10/25/2017       38        38    38
6   2001980  10/24/2017       37        37    37
7   2001980  10/18/2017        0        36     0
8   2001980  10/17/2017        0        35     0
9   2001980  10/16/2017       60        60    59
10  2001980  10/15/2017        0        59     0
11  2001980  10/14/2017        0        58     0
12  2001980  10/13/2017        0        57     0
13  2001980  10/12/2017        0        56     0
14  2002222  10/21/2017        0         0     0
15  2002222  10/20/2017        0         0     0
16  2002222  10/19/2017       16        16    15
17  2002222  10/18/2017        0        15     0
18  2002222  10/17/2017        0        14     0
19  2002222  10/16/2017       13        13    12
20  2002222  10/15/2017       12        12    12
21  2002222  10/14/2017       11        11    11
22  2002222  10/13/2017       10        10    10
23  2002222  10/12/2017        9         9     9

Upvotes: 0

Related Questions