Reputation: 33
I have the following data frame:
id date PRC 52wh
0 1 01/02/2010 1.0 1.0
1 1 01/03/2010 2.0 2.0
2 1 01/04/2010 10.0 10.0
3 1 01/05/2010 2.0 10.0
4 1 01/06/2010 0.5 10.0
5 1 01/07/2010 3.0 10.0
6 1 01/08/2010 5.0 10.0
7 1 01/09/2010 5.0 10.0
8 1 01/10/2010 7.0 10.0
9 1 01/11/2010 1.0 10.0
10 1 01/12/2010 3.0 10.0
11 2 01/02/2010 4.0 4.0
12 2 01/03/2010 5.0 5.0
13 2 01/04/2010 1.0 5.0
14 2 01/05/2010 2.0 5.0
15 2 01/06/2010 3.0 5.0
16 2 01/55/2055 4.0 3.0
What I need is to make a column that retrieves in each row the date on which the 52wh is equal to the price. Therefore, as long as the 52 wh in current row is higher than the PRC in current row the program would look backwards to check for when the 52wh equals PRC. The desired output is the following:
id date PRC 52wh date_y
2 1 01/02/2010 1.0 1.0 01/02/2010
5 1 01/03/2010 2.0 2.0 01/03/2010
6 1 01/04/2010 10.0 10.0 01/04/2010
7 1 01/05/2010 2.0 10.0 01/04/2010
8 1 01/06/2010 0.5 10.0 01/04/2010
9 1 01/07/2010 3.0 10.0 01/04/2010
10 1 01/08/2010 5.0 10.0 01/04/2010
11 1 01/09/2010 5.0 10.0 01/04/2010
12 1 01/10/2010 7.0 10.0 01/04/2010
13 1 01/11/2010 1.0 10.0 01/04/2010
14 1 01/12/2010 3.0 10.0 01/04/2010
16 2 01/02/2010 4.0 4.0 01/02/2010
17 2 01/03/2010 5.0 5.0 01/03/2010
20 2 01/04/2010 1.0 5.0 01/03/2010
23 2 01/05/2010 2.0 5.0 01/03/2010
26 2 01/06/2010 3.0 5.0 01/03/2010
29 2 01/55/2055 4.0 3.0 01/06/2010
For this I've already implemented the following code which does the job:
df = df_v1
df['uid'] = df.index
df_desc = df.sort_values('uid', ascending=False)
result = df.merge(df_desc[['PRC','date','uid']], left_on="52wh", right_on="PRC", how="left", suffixes=['', '_y']).drop(columns=['PRC_y'])
highest_uid = result[result.uid >= result.uid_y]
final_result = highest_uid.groupby('uid').head(1)
The problem is that I need to run it on a large dataset of a csv file of at least 250MB and this code is too memory intensive. I'm getting a memory error and nothing seems to work. Dividing the file into even smaller sample is not an option. Changing datatypes also doesn't work.
Can someone help me to optimize this one to run on a large file with over 5 million rows? Ideally would run on one with over 30 billion rows :) or several billion at least. I have a RAM memory of 16GB only.
The code to make the small data frame to work with is:
import pandas as pd
import numpy as np
df_v1 = pd.DataFrame(data=np.array([[1, '01/02/2010'], [1, '01/03/2010'], [1, '01/04/2010'], [1, '01/05/2010'], [1, '01/06/2010'], [1, '01/07/2010'], [1, '01/08/2010'], [1, '01/09/2010'], [1, '01/10/2010'], [1, '01/11/2010'], [1, '01/12/2010'], [2, '01/02/2010'], [2, '01/03/2010'], [2, '01/04/2010'], [2, '01/05/2010'], [2, '01/06/2010'], [2, '01/07/2010']]),
columns=['id', 'date'])
df_v1['PRC']=1,2,10,2,0.5,3,5,5,7,1,3,4,5,1,2,3,4
df_v1['52wh']=1,2,10,10,10,10,10,10,10,10,10,4,5,5,5,5,3
df_v1 = df_v1.astype({'PRC': 'float64', '52wh': 'float64'})
Really hope someone can help me out on this one!! Thanks in advance guys!
Upvotes: 2
Views: 72
Reputation: 181
Using this code on a df with 10mil lines took a few minutes to run.
prc_date_dict = dict()
def a(row):
# update the date value for the most recent PRC
prc_date_dict[row['PRC']] = row['date']
# get the date of the PRC that equals row 52wh
return prc_date_dict[row['52wh']]
df['date_y'] = df.apply(a, axis=1)
edit: removed redundant if, maybe this will run faster
Upvotes: 2