Reputation: 285
I have a dataset. I want a window of 5 values. Does pandas have a native function that will give me a rolling window of 5 values until there are no longer 5 values that it can use? I want these to be rows. I also want the new label to be the middle of the 5 values.
Input DataFrame
first label
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4
5 6 5
Output DataFrame desired:
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3
I have tried using the .rolling function and haven't been successful.
Upvotes: 0
Views: 886
Reputation: 28729
An alternative, more of a hack, I don't think pandas has a native function for what you want.
Convert dataframe to numpy, transpose dataframe and pull out labels and array, using a list comprehension:
M = df.to_numpy().T
outcome = [(M[0,i:5+i],
M[1][(5+i)//2])
for i in range(0,M.shape[1])
if 5+i <=M.shape[1]
]
print(outcome)
[(array([1, 2, 3, 4, 5]), 2), (array([2, 3, 4, 5, 6]), 3)]
pd.DataFrame(outcome,columns=['first','label'])
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3
Upvotes: 0
Reputation: 863701
You can use strides and for label
get position of middle value and by numpy indexing set value:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(df['first'].to_numpy(), 5)
print (a)
[[ 1 2 3 4 5]
[2 3 4 5 6]]
#get positions of middle value
i = rolling_window(np.arange(len(df)), 5)[:, 2]
print (i)
[2 3]
df = pd.DataFrame({'first':a.tolist(),
'label': df['label'].to_numpy()[i]})
print (df)
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3
You can more optimalize code for run strides only one:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
#get positions
idx = rolling_window(np.arange(len(df)), 5)
print (idx)
[[0 1 2 3 4]
[1 2 3 4 5]]
df = pd.DataFrame({'first': df['first'].to_numpy()[idx].tolist(),
'label': df['label'].to_numpy()[idx][:, 2]})
print (df)
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3
Upvotes: 1