freddy888
freddy888

Reputation: 1010

Pandas: Groupby Fillna Not working

I have the following dataframe, which has about 4000 Tickers and in Total about 2 million rows:

Ticker      Date              Rank         
  1         01/01/2000         5            
  1         01/02/2000        NaN             
  2         01/01/2000         4            
  2         01/02/2000         2            

I now run the following code to carryforward the Rank column, which works totally fine.:

import pandas as pd
df= df.sort_values(by=["Ticker", "Date"], ascending=[True,True])
df['Rank'] = df.groupby('Ticker')['Rank'].fillna(value=None, method="ffill")

However, I now want to carryforward a different column. In order to create this column I do the following:

  import numpy as np
  df["Code"]=np.NaN

In this function I write some code that about 200 values will be replaced by 1 according to the date and ticker values in the df "add". This code worksand looks the following:

df["Code"][(df.Date == add) & (df["Ticker"] == column)] = 1

This makes my dataframe look like this:

Ticker      Date              Rank          Code      
  1         01/01/2000         5             NaN
  1         01/02/2000        NaN            NaN
  2         01/01/2000         4              1
  2         01/02/2000         2             NaN

Now, I want to carryforward this column, but the code takes forever.

import pandas as pd
df= df.sort_values(by=["Ticker", "Date"], ascending=[True,True])
df['Code'] = df.groupby('Ticker')['Code'].fillna(value=None, method="ffill")

I have ran it for two days and my pc crashed. There must be some mistake here in the way I am doing things, because the above carryforward runs so fast and this one does not even finish. I checked the dtype of "Code" and it is float64.

Can anyone help?

Upvotes: 1

Views: 1801

Answers (1)

jezrael
jezrael

Reputation: 862771

I try a bit improve your code with sort=False to groupby and DataFrameGroupBy.ffill:

#convert column to datetime
df["Date"] = pd.to_datetime(df["Date"])
#ascending can be omit, because default value
df= df.sort_values(by=["Ticker", "Date"])

df['Rank'] = df.groupby('Ticker', sort=False)['Rank'].ffill()
add = '01/01/2000'
column = 2

df.loc[(df.Date == add) & (df["Ticker"] == column), "Code"] = 1

#sorting again is not necessary
df['Code'] = df.groupby('Ticker', sort=False)['Code'].ffill()
print (df)
   Ticker       Date  Rank  Code
0       1 2000-01-01   5.0   NaN
1       1 2000-01-02   5.0   NaN
2       2 2000-01-01   4.0   1.0
3       2 2000-01-02   2.0   1.0

Upvotes: 1

Related Questions