How to retrieve a value from a different column by comparing two columns (time between when the value was reached)?

Question

I have the following data frame:

    id    date      PRC    52wh
0   1   01/02/2010  1.0     1.0
1   1   01/03/2010  2.0     2.0
2   1   01/04/2010  10.0   10.0
3   1   01/05/2010  2.0    10.0
4   1   01/06/2010  0.5    10.0
5   1   01/07/2010  3.0    10.0
6   1   01/08/2010  5.0    10.0
7   1   01/09/2010  5.0    10.0
8   1   01/10/2010  7.0    10.0
9   1   01/11/2010  1.0    10.0
10  1   01/12/2010  3.0    10.0
11  2   01/02/2010  4.0    4.0
12  2   01/03/2010  5.0    5.0
13  2   01/04/2010  1.0    5.0
14  2   01/05/2010  2.0    5.0
15  2   01/06/2010  3.0    5.0
16  2   01/55/2055  4.0    3.0

What I need is to make a column that retrieves in each row the date on which the 52wh is equal to the price. Therefore, as long as the 52 wh in current row is higher than the PRC in current row the program would look backwards to check for when the 52wh equals PRC. The desired output is the following:

   id     date      PRC     52wh    date_y  
2   1   01/02/2010  1.0     1.0     01/02/2010  
5   1   01/03/2010  2.0     2.0     01/03/2010  
6   1   01/04/2010  10.0    10.0    01/04/2010  
7   1   01/05/2010  2.0     10.0    01/04/2010  
8   1   01/06/2010  0.5     10.0    01/04/2010  
9   1   01/07/2010  3.0     10.0    01/04/2010  
10  1   01/08/2010  5.0     10.0    01/04/2010  
11  1   01/09/2010  5.0     10.0    01/04/2010  
12  1   01/10/2010  7.0     10.0    01/04/2010  
13  1   01/11/2010  1.0     10.0    01/04/2010  
14  1   01/12/2010  3.0     10.0    01/04/2010  
16  2   01/02/2010  4.0     4.0     01/02/2010  
17  2   01/03/2010  5.0      5.0    01/03/2010  
20  2   01/04/2010  1.0     5.0     01/03/2010  
23  2   01/05/2010  2.0     5.0     01/03/2010  
26  2   01/06/2010  3.0     5.0     01/03/2010
29  2   01/55/2055  4.0     3.0     01/06/2010

For this I've already implemented the following code which does the job:

df = df_v1
df['uid'] = df.index
df_desc = df.sort_values('uid', ascending=False)

result = df.merge(df_desc[['PRC','date','uid']], left_on="52wh", right_on="PRC", how="left", suffixes=['', '_y']).drop(columns=['PRC_y'])
highest_uid = result[result.uid >= result.uid_y]
final_result = highest_uid.groupby('uid').head(1)

The problem is that I need to run it on a large dataset of a csv file of at least 250MB and this code is too memory intensive. I'm getting a memory error and nothing seems to work. Dividing the file into even smaller sample is not an option. Changing datatypes also doesn't work.

Can someone help me to optimize this one to run on a large file with over 5 million rows? Ideally would run on one with over 30 billion rows :) or several billion at least. I have a RAM memory of 16GB only.

The code to make the small data frame to work with is:

import pandas as pd
import numpy as np

df_v1 = pd.DataFrame(data=np.array([[1, '01/02/2010'], [1, '01/03/2010'], [1, '01/04/2010'], [1, '01/05/2010'], [1, '01/06/2010'], [1, '01/07/2010'], [1, '01/08/2010'], [1, '01/09/2010'], [1, '01/10/2010'], [1, '01/11/2010'], [1, '01/12/2010'], [2, '01/02/2010'], [2, '01/03/2010'], [2, '01/04/2010'], [2, '01/05/2010'], [2, '01/06/2010'], [2, '01/07/2010']]), 
columns=['id', 'date'])
df_v1['PRC']=1,2,10,2,0.5,3,5,5,7,1,3,4,5,1,2,3,4
df_v1['52wh']=1,2,10,10,10,10,10,10,10,10,10,4,5,5,5,5,3
df_v1 = df_v1.astype({'PRC': 'float64', '52wh': 'float64'})

Really hope someone can help me out on this one!! Thanks in advance guys!

richie · Accepted Answer

Using this code on a df with 10mil lines took a few minutes to run.

prc_date_dict = dict()

def a(row):
    # update the date value for the most recent PRC
    prc_date_dict[row['PRC']] = row['date']
    # get the date of the PRC that equals row 52wh
    return prc_date_dict[row['52wh']]

df['date_y'] = df.apply(a, axis=1)

edit: removed redundant if, maybe this will run faster

How to retrieve a value from a different column by comparing two columns (time between when the value was reached)?

Answers (1)

Related Questions