Predicting algorithm performance, O-notation

Question

I'm applying k-means based clustering on a set of text fields. The calculation completed performancewise as follows:

     1.000 records ~    4m:30s
    30.000 records ~   15m:30s
   100.000 records ~ 1h37m:30s

How can I estimate how much time would be necessary to complete the calculation of n records, for example 500.000. Could someone help me with an hands-on example, probably that would be done with the O-Notation, I don't understand how it works, tough.

Thank you very much.

user1196549 · Accepted Answer

The big-O notation is useful when you have some theoretical insight on the behavior of the algorithm. Like, for instance, you know that O(Log(N)) comparisons are made to find an item in a sorted list, if you are using a dichotomic search, vs O(N) for a linear search (this is on average, as sometimes the linear search can find immediately). In addition, big-O is a kind of upper bound that describes the growth rate but will not give you absolute figures in seconds, it is more qualitative than quantitative.

In your case, you are more on the empirical side and you should choose some numerical model with unknown parameters and find these by regression. The first to be tried is the power law: T(n)=a.n^b, or, expressed in bilogarithmic coordinates, a straight line Log(T(n)) = c. Log(n) + d. So it is advisable to observe your points on a bilogarithmic plot (Excel can help you).

Two more remarks:

three points is really little to observe the real trend, and any model will fit equally well;
more importantly, you must measure the dispersion on the running times for different cases for each size; because if the dispersion is large, your predictions based on a few values will just be erratic. This requirement is even stronger if you want to extrapolate rather than interpolate (I mean make guesses for larger values than you actually tried).

In your particular case, I observe that the three points do not align well and are not compatible wit a power model. You could consider a parabolic model, which would work fine, but this would be cheating: a parabolic model always perfectly fits to three points.

enter image description here

The conclusion is crystal clear: you need more points.

Predicting algorithm performance, O-notation

Answers (1)

Related Questions