Mahsolid
Mahsolid

Reputation: 433

Array ranges for columns in python pandas

I have a three column dataset formatted as in the following

    t_stamp,Xval,Ytval
    0.000543,0,10
    0.000575,0,10
    0.041324,1,10
    0.041331,2,10
    0.041336,3,10
    0.04134,4,10
    0.041345,5,10
    0.04135,6,10
    0.041354,7,10


df = pd.read_csv('test.csv')

I wanted to run only for the first 3 rows, I can do it like the following

for i in range(1,4):
    df['X_t'+str(i)] = df['X'].shift(i)
X = df[['X', 'X_t1', 'X_t2', 'X_t3']].values

and if I want to do it for 5 ranges, I can do it like the following

for i in range(1,6):
        df['X_t'+str(i)] = df['X'].shift(i)
    X = df[['X', 'X_t1', 'X_t2', 'X_t3', 'X_t4', 'X_t5']].values

However, this is not efficient if we want to run it for higher ranges. How can I automate so that we don't have to manually assign the values of X as X = df[['X', 'X_t1', 'X_t2', 'X_t3']].values let's say if we want to run for 1000 ranges? Doing X = df[['X', 'X_t1', 'X_t2', 'X_t3', 'X_t4', ..., 'X_1000']].values is not efficient.

Upvotes: 1

Views: 468

Answers (1)

cs95
cs95

Reputation: 402854

Use a dict comprehension, build a dict of columns, create a temporary data frame and extract its values:

In [1679]: pd.DataFrame({ 'X_%d'%i : df['Xval'].shift(i) for i in range(5)}).values
Out[1679]: 
array([[  0.,  nan,  nan,  nan,  nan],
       [  0.,   0.,  nan,  nan,  nan],
       [  1.,   0.,   0.,  nan,  nan],
       [  2.,   1.,   0.,   0.,  nan],
       [  3.,   2.,   1.,   0.,   0.],
       [  4.,   3.,   2.,   1.,   0.],
       [  5.,   4.,   3.,   2.,   1.],
       [  6.,   5.,   4.,   3.,   2.],
       [  7.,   6.,   5.,   4.,   3.]])

Similarly,

In [1680]: pd.DataFrame({ 'X_%d'%i : df['Xval'].shift(i) for i in range(10)}).values
Out[1680]: 
array([[  0.,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan],
       [  0.,   0.,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan],
       [  1.,   0.,   0.,  nan,  nan,  nan,  nan,  nan,  nan,  nan],
       [  2.,   1.,   0.,   0.,  nan,  nan,  nan,  nan,  nan,  nan],
       [  3.,   2.,   1.,   0.,   0.,  nan,  nan,  nan,  nan,  nan],
       [  4.,   3.,   2.,   1.,   0.,   0.,  nan,  nan,  nan,  nan],
       [  5.,   4.,   3.,   2.,   1.,   0.,   0.,  nan,  nan,  nan],
       [  6.,   5.,   4.,   3.,   2.,   1.,   0.,   0.,  nan,  nan],
       [  7.,   6.,   5.,   4.,   3.,   2.,   1.,   0.,   0.,  nan]])

range starts from 0, because df.shift(0) is just X, which you want as well.

If you want to replace those nan values, you can use df.fillna(0) first, and then extract its values:

pd.DataFrame({ 'X_%d'%i : df['Xval'].shift(i) for i in range(5)}).fillna(0).values

Alternatively, apply np.nan_to_num:

pd.DataFrame({ 'X_%d'%i : df['Xval'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values

Alternatively, use dropna, if you are okay with your dataframe being truncated:

In [1763]: pd.DataFrame({ 'X_%d'%i : df['Xval'].shift(i) for i in range(5)}).dropna().values
Out[1763]: 
array([[ 3.,  2.,  1.,  0.,  0.],
       [ 4.,  3.,  2.,  1.,  0.],
       [ 5.,  4.,  3.,  2.,  1.],
       [ 6.,  5.,  4.,  3.,  2.],
       [ 7.,  6.,  5.,  4.,  3.]])

Compare this to the first example, all rows with nan are lopped off.

Upvotes: 5

Related Questions