Reputation: 363
Suppose that I have a data set S that contains the service time for different jobs, like S={t1,t2,t3,...,tn}
, where ti is the service time for the ith job; and n the total number in my data set. This S is only a sample from a population. n here is 300k. I would like to study the impact of long service time as some jobs takes very long and some not. My intuition is to study this impact based on data gathered from real system. The system in study has thousands of millions of jobs, and this number is increasing by 100 new jobs each several seconds. Also, service time is measured via benchmarking the jobs on a local machine. So practically it is expensive to keep expanding your data set. Thus, i decided to randomly pick up 300k.
I am conducting simulation experiments where I have to generate a large number of jobs with their service times (say millions) and then do some other calculations.
How to use S as a population in my simulation, I came across the following:
1- use S itself. I could use bootstrapping 'sample with replacement' or ' sample without replacement'.
2- fit a theoretical distribution model to S and then draw from it.
Am I correct? which approach is best (pros and cons)? the first approach seems easy as just picking a random service time from S each time? is it reliable? Any suggestion is appreciated as I am not got at stats.
Upvotes: 0
Views: 244
Reputation: 19855
Quoting from this tutorial in the 2007 Winter Simulation Conference:
At first glance, trace-driven simulation seems appealing. That is where historical data are used directly as inputs. It’s hard to argue about the validity of the distributions when real data from the real-world system is used in your model. In practice, though, this tends to be a poor solution for several reasons. Historical data may be expensive or impossible to extract. It certainly won’t be available in unlimited quantities, which significantly curtails the statistical analysis possible. Storage requirements are high. And last, but not least, it is impossible to assess “what-if?” strategies or try to simulate a prospective system, i.e., one which doesn’t yet exist.
In summary, distribution fitting takes more work up front but is usually more useful in the long run.
Upvotes: 1