Reputation: 2142
I'm performing some processing of rather large amounts of data. I did several tests with some constant number of records (1 million, 10 millions and 100 millions) and measured execution time with time(1). So, I have the following CSV with results (the columns are the following: number of records, extra processing, elapsed time, user time, sys time):
1000000,false,4.29,13.62,0.48
1000000,true,8.78,28.28,0.89
10000000,false,69.17,229.20,8.26
10000000,true,106.89,343.34,11.78
100000000,false,1053.46,3058.38,126.66
100000000,true,1255.68,4011.54,143.87
1000000,false,8.40,27.86,1.01
1000000,true,12.59,40.75,1.44
10000000,false,92.84,309.81,10.85
10000000,true,125.52,410.81,14.06
100000000,false,963.49,2935.52,116.03
100000000,true,1435.18,4238.75,154.30
1000000,false,9.12,29.94,1.14
1000000,true,12.90,42.21,1.48
10000000,false,96.32,321.50,11.65
10000000,true,122.68,400.36,13.92
100000000,false,872.66,2876.10,109.40
100000000,true,1170.53,3771.05,131.80
1000000,false,11.07,36.70,1.28
1000000,true,13.21,43.15,1.44
10000000,false,94.08,312.17,11.42
10000000,true,126.83,411.92,14.10
100000000,false,870.20,2861.60,109.60
100000000,true,1138.72,3692.30,127.56
1000000,false,8.60,28.48,1.04
1000000,true,13.14,42.88,1.48
10000000,false,87.76,290.91,10.50
10000000,true,118.03,382.60,12.80
100000000,false,858.91,2822.96,106.71
100000000,true,1190.48,3857.58,133.79
1000000,false,8.91,29.59,1.00
1000000,true,12.91,42.01,1.55
10000000,false,89.62,296.94,11.00
10000000,true,116.50,378.21,12.77
100000000,false,870.43,2858.22,109.46
100000000,true,1126.05,3641.41,127.34
1000000,false,9.46,31.40,1.20
1000000,true,11.12,36.28,1.17
10000000,false,87.26,289.12,10.78
10000000,true,115.46,372.48,12.70
100000000,false,1044.48,3029.55,121.52
100000000,true,1393.75,4083.24,147.38
1000000,false,9.75,30.62,1.24
1000000,true,14.79,45.33,1.52
10000000,false,99.32,317.52,12.20
10000000,true,150.65,428.98,16.02
100000000,false,916.92,2979.20,115.72
100000000,true,1119.58,3619.34,126.22
1000000,false,8.85,29.42,1.04
1000000,true,12.47,40.42,1.40
10000000,false,94.12,312.18,11.27
10000000,true,121.16,393.87,13.56
100000000,false,884.21,2898.08,110.16
100000000,true,1131.85,3655.16,128.92
1000000,false,8.86,29.51,1.08
1000000,true,12.32,40.12,1.21
10000000,false,89.75,298.62,10.80
10000000,true,114.46,371.82,12.69
100000000,false,868.67,2842.56,109.55
100000000,true,1139.24,3680.05,127.93
How can I predict the time to process, for example, a billion of records? I'm going to use R to have the ability to visualize data as well.
Upvotes: 1
Views: 144
Reputation: 73325
There is nothing to predict, using your current data. Although your have many observations, they are only collected on 3 unique problem size: 1 million, 10 millions and 100 millions.
Your data, when plotted, are:
We need a regression model, to make prediction. But with such data, it is not possible to do this reliably. You need to collect data on more problem size, like 1, 2, 3, 4, 5 ...., 99, 100 millions. For each size, collect data with / without extra processing. Only this, we can estimate how the processing time grows with your problem size. For example, is it linear growth, quadratic growth?
Upvotes: 1