Yati Raj
Yati Raj

Reputation: 448

Why does this loop scale with quadratic time

I am rearranging a matrix in numpy using the code below

t1=time.time()

df1=train1[1,1:52]
for i in xrange(40):
    for j in xrange(52,551):
        x=train1[i,(j-51):j]
        df1=np.vstack((df1,x))

t2=time.time()
t=t2-t1

While running the loop for the outer loop for 5 turns [ i in xrange(5) and j unchanged ], it takes <1 sec. For 10 turns, it takes ~4 secs; for 20 turns, it takes ~18 secs; for 40 turns, ~85 secs.

Can someone please clarify why the loop is scaling in quadratic time, even when we increase the outer loop linearly.

Thanks

PS: The matrix I am using here is the training set for the Kaggle training set for the Wikipedia competition. You can download train_1.csv from link which I have read into a pandas dataframe and then converted to a numpy matrix (i.e. train1) using .to_matrix()

Upvotes: 1

Views: 186

Answers (2)

Arthur Spoon
Arthur Spoon

Reputation: 462

I just ran your code with more time checks and using a random generated matrix for train1:

import time
import numpy as np

t_total=time.time()
train1=np.random.random((20, 550))
df1=train1[1,1:52]
for i in range(5):
    t1 = time.time()
    tj = []
    for j in range(52,551):
        t2 = time.time()
        x=train1[i,(j-51):j]
        df1=np.concatenate((df1,x),axis=0)
        tj.append(time.time()-t2) 
    print("Time to loop on j:", time.time()-t1)
    print("Average time for each j:", np.mean(tj))

print("Total time:", time.time()-t_total)

When I run this I get the following output, showing that clearly every loop gets longer and longer.

Time to loop on j: 0.009780406951904297
Average time for each j: 1.9157577851e-05
Time to loop on j: 0.02693343162536621
Average time for each j: 5.33469932113e-05
Time to loop on j: 0.06705927848815918
Average time for each j: 0.000133752822876
Time to loop on j: 0.08919048309326172
Average time for each j: 0.000178138813179
Time to loop on j: 0.11366486549377441
Average time for each j: 0.000227188060661
Total time: 0.3072977066040039

My guess is: np.vstack simply takes more time as the size of matrices input increases, and that's what's causing your execution time to increase exponentially. I can't seem to find an equivalent of numpy that would deal with this issue however... A solution that works is to store in a list all the array you want to stack, then compute the stack in the end:

import time
import numpy as np

t_total=time.time()
train1=np.random.random((20, 550))
df1=[train1[1,1:52]]
for i in range(5):
    t1 = time.time()
    tj = []
    for j in range(52,551):
        t2 = time.time()
        x=train1[i,(j-51):j]
        df1.append(x)
        tj.append(time.time()-t2) 
    print("Time to loop on j:", time.time()-t1)
    print("Average time for each j:", np.mean(tj))
df1 = np.vstack(df1)
print("Total time:", time.time()-t_total)

And the running times this gets me:

Time to loop on j: 0.0005383491516113281
Average time for each j: 7.99347260194e-07
Time to loop on j: 0.0005192756652832031
Average time for each j: 7.58734876981e-07
Time to loop on j: 0.0005254745483398438
Average time for each j: 7.73546452035e-07
Time to loop on j: 0.0005245208740234375
Average time for each j: 7.73546452035e-07
Time to loop on j: 0.0005295276641845703
Average time for each j: 7.80235550447e-07
Total time: 0.008821249008178711

It seems that stacking many small arrays is easier than stacking big arrays, or something like that. Then appending an object of fixed size to a list has a fixed cost whatever the size of the list.

Upvotes: 1

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

The problem is that the call to vstack creates a copy of df1 in each iteration. Sice the size of df1 varies linearly with the outer loop range, you get your quadratic runtime.

Profiling the code shows that most of the time is spent in concatenate, which is called by vstack:

In [13]: cProfile.run('q.proc()')
         259486 function calls in 19.759 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001   19.759   19.759 <string>:1(<module>)
    39920    0.030    0.000    0.036    0.000 numeric.py:534(asanyarray)
    19960    0.031    0.000   15.037    0.001 shape_base.py:182(vstack)
    19960    0.020    0.000    0.121    0.000 shape_base.py:237(<listcomp>)
    39920    0.057    0.000    0.101    0.000 shape_base.py:63(atleast_2d)
        1    4.720    4.720   19.758   19.758 temp.py:6(proc)
        1    0.000    0.000   19.759   19.759 {built-in method builtins.exec}
    39920    0.003    0.000    0.003    0.000 {built-in method builtins.len}
    39920    0.006    0.000    0.006    0.000 {built-in method numpy.core.multiarray.array}
    19960   14.886    0.001   14.886    0.001 {built-in method numpy.core.multiarray.concatenate}
        2    0.000    0.000    0.000    0.000 {built-in method time.time}
    39920    0.005    0.000    0.005    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

You could create a list of x's and concatenate after the loop.

Edit: I defined train1 as np.random.rand(100,600).

Upvotes: 2

Related Questions