Saeed
Saeed

Reputation: 2089

How to speed up this Python function?

I hope it is OK to ask questions of this type.

I have a get_lags function that takes a data frame, and for each column, shifts the column by each n in the list n_lags. So, if n_lags = [1, 2], the function shifts each column once by 1 and once by 2 positions, creating new lagged columns in this way.

def get_lags (df, n_lags):
    data =df.copy()
    data_with_lags = pd.DataFrame()
    for column in data.columns:
        for i in range(n_lags[0], n_lags[-1]+1):
            new_column_name = str(column) + '_Lag' + str(i)
            data_with_lags[new_column_name] = data[column].shift(-i) 
    data_with_lags.fillna(method = 'ffill', limit = max(n_lags), inplace = True)
    return data_with_lags

So, if:

df.columns
ColumnA
ColumnB

Then, get_lags(df, [1 , 2]).columns will be:

ColumnA_Lag1
ColumnA_Lag2
ColumnB_Lag1
ColumnB_Lag2

Issue: working with data frames that have about 100,000 rows and 20,000 columns, this takes forever to run. On a 16-GB RAM, core i7 windows machine, once I waited for 15 minutes to the code to run before I stopped it. Is there anyway I can tweak this function to make it faster?

Upvotes: 1

Views: 76

Answers (2)

cs95
cs95

Reputation: 402253

You'll need shift + concat. Here's the concise version -

def get_lags(df, n_lags):
    return pd.concat(
       [df] + [df.shift(i).add_suffix('_Lag{}'.format(i)) for i in n_lags],
       axis=1
    )

And here's a more memory-friendly version, using a for loop -

def get_lags(df, n_lags):
    df_list = [df]
    for i in n_lags:
        v = df.shift(i)
        v.columns = v.columns + '_Lag{}'.format(i)    
        df_list.append(v)

    return pd.concat(df_list, axis=1)

Upvotes: 1

viraptor
viraptor

Reputation: 34145

This may not apply to your case (I hope I understand what you're trying to do correctly), but you can speed it up massively by not doing it in the first place. Can you treat your columns like a ring buffer?

Instead of changing the columns afterwards, keep track of:

  • how many columns can you use (how many lag items for each entry)
  • what was the last lag column used
  • (optionally) how many times you "rotated"

So instead of moving the data, you do something like:

current_column = (current_column + 1) % total_columns

and write to that column next.

Upvotes: 1

Related Questions