Reputation: 2089
I hope it is OK to ask questions of this type.
I have a get_lags
function that takes a data frame, and for each column, shifts the column by each n in the list n_lags
. So, if n_lags = [1, 2]
, the function shifts each column once by 1 and once by 2 positions, creating new lagged columns in this way.
def get_lags (df, n_lags):
data =df.copy()
data_with_lags = pd.DataFrame()
for column in data.columns:
for i in range(n_lags[0], n_lags[-1]+1):
new_column_name = str(column) + '_Lag' + str(i)
data_with_lags[new_column_name] = data[column].shift(-i)
data_with_lags.fillna(method = 'ffill', limit = max(n_lags), inplace = True)
return data_with_lags
So, if:
df.columns
ColumnA
ColumnB
Then, get_lags(df, [1 , 2]).columns
will be:
ColumnA_Lag1
ColumnA_Lag2
ColumnB_Lag1
ColumnB_Lag2
Issue: working with data frames that have about 100,000 rows and 20,000 columns, this takes forever to run. On a 16-GB RAM, core i7 windows machine, once I waited for 15 minutes to the code to run before I stopped it. Is there anyway I can tweak this function to make it faster?
Upvotes: 1
Views: 76
Reputation: 402253
You'll need shift
+ concat
. Here's the concise version -
def get_lags(df, n_lags):
return pd.concat(
[df] + [df.shift(i).add_suffix('_Lag{}'.format(i)) for i in n_lags],
axis=1
)
And here's a more memory-friendly version, using a for loop -
def get_lags(df, n_lags):
df_list = [df]
for i in n_lags:
v = df.shift(i)
v.columns = v.columns + '_Lag{}'.format(i)
df_list.append(v)
return pd.concat(df_list, axis=1)
Upvotes: 1
Reputation: 34145
This may not apply to your case (I hope I understand what you're trying to do correctly), but you can speed it up massively by not doing it in the first place. Can you treat your columns like a ring buffer?
Instead of changing the columns afterwards, keep track of:
So instead of moving the data, you do something like:
current_column = (current_column + 1) % total_columns
and write to that column next.
Upvotes: 1