How to speed up this Python function?

Question

I hope it is OK to ask questions of this type.

I have a get_lags function that takes a data frame, and for each column, shifts the column by each n in the list n_lags. So, if n_lags = [1, 2], the function shifts each column once by 1 and once by 2 positions, creating new lagged columns in this way.

def get_lags (df, n_lags):
    data =df.copy()
    data_with_lags = pd.DataFrame()
    for column in data.columns:
        for i in range(n_lags[0], n_lags[-1]+1):
            new_column_name = str(column) + '_Lag' + str(i)
            data_with_lags[new_column_name] = data[column].shift(-i) 
    data_with_lags.fillna(method = 'ffill', limit = max(n_lags), inplace = True)
    return data_with_lags

So, if:

df.columns
ColumnA
ColumnB

Then, get_lags(df, [1 , 2]).columns will be:

ColumnA_Lag1
ColumnA_Lag2
ColumnB_Lag1
ColumnB_Lag2

Issue: working with data frames that have about 100,000 rows and 20,000 columns, this takes forever to run. On a 16-GB RAM, core i7 windows machine, once I waited for 15 minutes to the code to run before I stopped it. Is there anyway I can tweak this function to make it faster?

cs95 · Accepted Answer

You'll need shift + concat. Here's the concise version -

def get_lags(df, n_lags):
    return pd.concat(
       [df] + [df.shift(i).add_suffix('_Lag{}'.format(i)) for i in n_lags],
       axis=1
    )

And here's a more memory-friendly version, using a for loop -

def get_lags(df, n_lags):
    df_list = [df]
    for i in n_lags:
        v = df.shift(i)
        v.columns = v.columns + '_Lag{}'.format(i)    
        df_list.append(v)

    return pd.concat(df_list, axis=1)

How to speed up this Python function?

Answers (2)

Related Questions