Tim
Tim

Reputation: 119

Looking for an efficient way to reorganize a pandas Dataframe

I have data in a Dataframe structured like so...

       D           K         Factor        p
0   -0.483128   -1.240024  -1.214765   -1.002418
1   -0.692334   -1.632132   1.562630    0.997304
2   -1.189383   -1.632132   1.562630    0.997304
3   -1.691841   -1.632132   1.562630    0.997304
4   -2.084926   -1.632132   1.562630    0.997304

I'm trying to reorganize the data into a new structure where each row contains 'period' number of rows from the existing data. Moves ahead one row in existing and the stacks the next 'period lines'.

the function so far:

def prepData(seq, period):
    newStacks = pd.DataFrame()
    for pos in range(0, len(seq) - (period+1), 1):
        chunk = (seq[pos:pos + period])
        stack = []
        for sliver in range(0, len(chunk), 1):
            piece = (chunk.iloc[sliver:])
            print(piece)
            stack.append(piece) 
        
        newStacks.append(chunk)

return newStacks

this obviously is not efficient and aside doesn't produce the desire structure. The aim is to get a structure like below considering period = 3

0   -0.483128   -1.240024   -1.214765   -1.002418   -0.692334   -1.632132   1.562630    0.997304 -1.189383  -1.632132   1.562630    0.997304
1   -0.692334   -1.632132   1.562630    0.997304    -1.189383   -1.632132   1.562630    0.997304 -1.691841  -1.632132   1.562630    0.997304
2   -1.189383   -1.632132   1.562630    0.997304    -1.691841   -1.632132   1.562630    0.997304 -2.084926  -1.632132   1.562630    0.997304

AA Simple way to accomplish this would be appreciated.

Upvotes: 0

Views: 75

Answers (2)

jezrael
jezrael

Reputation: 862691

If there are all floats columns for improve performance use strides in numpy with reshape 3d array to 2d and pass to DataFrame constructor:

#https://stackoverflow.com/a/44306231/2901002 a bit changed
def strided_lastaxis(a, L):
    s0,s1 = a.strides
    m,n = a.shape
    return np.lib.stride_tricks.as_strided(a, shape=(m-L+1,L,n), strides=(s0,s0,s1))

a = strided_lastaxis(df.to_numpy(), 3)

df1 = pd.DataFrame(a.reshape(a.shape[0], -1))
print (df1)
         0         1         2         3         4         5        6   \
0 -0.483128 -1.240024 -1.214765 -1.002418 -0.692334 -1.632132  1.56263   
1 -0.692334 -1.632132  1.562630  0.997304 -1.189383 -1.632132  1.56263   
2 -1.189383 -1.632132  1.562630  0.997304 -1.691841 -1.632132  1.56263   

         7         8         9        10        11  
0  0.997304 -1.189383 -1.632132  1.56263  0.997304  
1  0.997304 -1.691841 -1.632132  1.56263  0.997304  
2  0.997304 -2.084926 -1.632132  1.56263  0.997304  

Upvotes: 2

Joe Ferndz
Joe Ferndz

Reputation: 8508

I am not sure if you want to create a dataframe or a list. Here's the code to get you both.

import pandas as pd
c = ['D','K','Factor','p']
d = [[-0.483128,   -1.240024,  -1.214765,   -1.002418],
[-0.692334,   -1.632132,   1.562630,    0.997304],
[-1.189383,   -1.632132,   1.562630,    0.997304],
[-1.691841,   -1.632132,   1.562630,    0.997304],
[-2.084926,   -1.632132,   1.562630,    0.997304]]
df = pd.DataFrame(d,columns=c)
print (df)

p = 3 #this is the period you wanted. I set it to 3
stack_list = []   #this will store the final stacked list
                  #note: don't use stack, its used by pandas to stack

for i in range(len(df)-p+1):  #iterating thru the dataframe
    # convert p rows to a list after you stack them
    chunk = df.loc[i:i+p-1].stack().reset_index(level=1,drop=True).tolist()

    stack_list.append(chunk)   #store chunk to stack_list


df1 = pd.DataFrame(stack_list)   #creating a dataframe as per your request

#printing both stack_list and dataframe

print (stack_list)
print (df1)

The output of this will be:

The original dataframe is:

          D         K    Factor         p
0 -0.483128 -1.240024 -1.214765 -1.002418
1 -0.692334 -1.632132  1.562630  0.997304
2 -1.189383 -1.632132  1.562630  0.997304
3 -1.691841 -1.632132  1.562630  0.997304
4 -2.084926 -1.632132  1.562630  0.997304

The stacked list is:

[[-0.483128, -1.240024, -1.214765, -1.002418, -0.692334, -1.632132, 1.56263, 0.997304, -1.189383, -1.632132, 1.56263, 0.997304], 
 [-0.692334, -1.632132, 1.56263, 0.997304, -1.189383, -1.632132, 1.56263, 0.997304, -1.691841, -1.632132, 1.56263, 0.997304], 
 [-1.189383, -1.632132, 1.56263, 0.997304, -1.691841, -1.632132, 1.56263, 0.997304, -2.084926, -1.632132, 1.56263, 0.997304]]

The new dataframe that you wanted to get created is:

         0         1         2   ...        9        10        11
0 -0.483128 -1.240024 -1.214765  ... -1.632132  1.56263  0.997304
1 -0.692334 -1.632132  1.562630  ... -1.632132  1.56263  0.997304
2 -1.189383 -1.632132  1.562630  ... -1.632132  1.56263  0.997304

Upvotes: 2

Related Questions