Reputation: 123
I'm using criterion to benchmark a function. The end goal is to create a box plot, for different inputs to this function, however, to my understanding of criterion, the output is not enough for this.
Example bench from $ cabal bench
:
benchmarking...
measurement took 6.317 s
analysing with 1000 resamples
bootstrapping with 7 of 7 samples (100%)
time 174.5 ms (172.7 ms .. 177.0 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 176.0 ms (175.0 ms .. 177.2 ms)
std dev 1.524 ms (993.5 μs .. 1.943 ms)
variance introduced by outliers: 12% (moderately inflated)
Looking at the docs, I thought I could get all measured times by using benchmarkWith
along with Config
instead of just benchmark
but I'm missing something.
My current attempt:
import Criterion
import Criterion.Main
import Criterion.Types
import Data.Vector (Vector)
import qualified Data.Vector as V
config :: Config
config = defaultConfig
{ csvFile = Just "benchmark.csv"
, verbosity = Verbose
, rawDataFile = Just "rawbench.bin"
, template = "template.txt"
}
main :: IO ()
main = do
report <- benchmarkWith' config $ nf (fun x) z
let a = reportMeasured report :: Vector Measured
let b = fmap measTime a :: Vector Double
writefile "times.txt" $ show b
csvFile
gives the values in the example in .csv format - not all data values. I do not yet know how the rawDataFile
, nor the template
looks as they aren't created. The Vector
attained from measTime
reportMeasured
includes times that are far of from any of the times in the ranges shown in the analysis, so I'm unsure what to do with it. They seem to be saved as the total elapsed time, and not the elapsed time for each iteration:
b = [0.1781445460001123,0.3514027309975063,0.5219033929970465,0.7031883550007478,0.8739448289998109,1.049261161002505,1.2441970670006413]
If we adjust for this via
let c = V.zipWith (-) b (V.cons 0 (V.init b))
we get
c = [0.17789985700073885,0.17817427000045427,0.17363899500196567,0.16757315399809158,0.17504766799902427,0.18031946599876392,0.17502718700052355]
Which seem more plausible. Some values might be outside the range of the analysis, but they're not 3x the max value any more.
Is it possible to acquire the full list of measured times - is the above correct? It seems the number of saved times aren't the number of resamples, as these may be only used for the analysis. Is it possible to adjust the number of samples?
Upvotes: 2
Views: 108
Reputation: 50864
Because timing a single evaluation of a fast-running function isn't very reliable, Criterion works by benchmarking your function in a loop. It starts by testing a loop with a single iteration, and then starts increasing the number of iterations -- by 1 for a while, and then by larger increments. The number of iterations used for a particular measurement will be stored in the measIters
field of the Measured
value.
So, the increasing times you're seeing are not the result of Criterion reporting total accumulated time. Instead, they are the result of Criterion increasing the number of iterations on subsequent tests. Because your function is relatively slow, and the default configuration sets the total time limit for benchmarking a single function at 5 seconds, it looks like Criterion increases the number of iterations from 1 up to 7 before the total time spent benchmarking exceeds the time limit, so it never gets around to increasing the number of iterations by anything more than 1. Your calculation of differences "works", but only because you're looking at the difference between say, running 5 iterations of your function versus 4 iterations of your function.
You can use the rescale
function from Criterion.Types
(a re-export from Criterion.Measurement.Types
in the criterion-measurement
package) to "adjust" a Measured
value for the number of iterations. This basically divides all the statistics by the number of iterations, so your measTime
will be the time per iteration.
Also, the "samples" and "resamples" can be confusing, but what's happened here is that your function has (probably) been evaluated 1+2+3+4+5+6+7=28 times total (averaging roughly 175ms per evaluation for a total of 4.9 seconds, just under the 5 second time limit -- the 6.317 seconds reported in your output probably includes some Criterion overhead). These 28 evaluations have only provided 7 samples (i.e., the 7 Measured
values in your Report
), corresponding with an initial 1-iteration sample through to the final 7-iteration sample. The resulting 7 mean evaluation times after rescale
have then been resampled 1000 times in a statistical process called bootstrapping, in order to generate confidence intervals for the mean and standard deviation of the evaluation time. Note that these resamples don't represent any actual data collection, so they aren't useful to you in constructing a boxplot.
That's how Criterion works.
Unfortunately, you can't extract a useful boxplot of evaluation times from the data that results from this approach. You can certainly generate a boxplot of something, by upping the number of samples (i.e., increasing the timeLimit
in your Config
) and using rescale
to generate times that relate to a single evaluation, but the distribution of the resulting collection of numbers is not the distribution of times for a single function evaluation (because the distribution of a mean calculated over multiple iterations is not the distribution of the time for a single iteration, cf. Sampling distribution), so the resulting boxplot doesn't represent any meaningful underlying distribution.
In your case, I think you are trying to produce a boxplot of your function's performance over a set of randomized inputs, right? If so, the "correct" but time-consuming approach is to use Criterion to run one benchmark per "random" input, ignore all the individual samples, and just use the best estimate of the function runtime for the input, namely the indicated time here, ignoring all the other numbers in this table.
time 174.5 ms (172.7 ms .. 177.0 ms)
^^^^^^^^
1.000 R² (1.000 R² .. 1.000 R²)
mean 176.0 ms (175.0 ms .. 177.2 ms)
std dev 1.524 ms (993.5 μs .. 1.943 ms)
variance introduced by outliers: 12% (moderately inflated)
Once you have a collection of such times over, say, 30 random benchmarked inputs, you can produce a boxplot of those times, and you'll get a sensible boxplot of the distribution of execution times for your function over random inputs, where variation from sources other than the changing input (e.g., random garbage collection, competing processes in a multiprocessing system, cosmic rays, etc.) has been minimized through Criterion's extensive sampling and post-processing of results.
Upvotes: 1