Ryan Saxe
Ryan Saxe

Reputation: 17869

how to predict how long it will take for python to run a script?

So I want to make sure that I run my program when it is optimal, for example, if it will take 5 hours to complete, I should run it overnight!

I do know this program will end, and theoretically I should be able to base length on size. So here is the actual problem:

I need to open 16 pickled files that are pandas DataFrames that add up to a total of 1.5 gigs. Note, I will also need to do this with DataFrames that add up to 20 gigs, so the answer I need is a way of telling how long the following code will take given total amounts of gigs:

import pickle
import os
def pickleSave(data, pickleFile):
    output = open(pickleFile, 'wb')
    pickle.dump(data, output)
    output.close()
    print "file has been saved to %s" % (pickleFile)
def pickleLoad(pickleFile):
    pkl_file = open(pickleFile, 'rb')
    data = pickle.load(pkl_file)
    pkl_file.close()
    return data
directory = '/Users/ryansaxe/Desktop/kaggle_parkinsons/GPS/'
files = os.listdir(directory)
dfs = [pickleLoad(directory + i) for i in files]
new_file = directory + 'new_file_dataframe'
pickleSave(dfs,new_file)

so now I need to write a function that will look like the following:

def time_fun(data_size_in_gigs):
    #some algorithm here
    print "your code will take ___ hours to run"

I have no clue how to approach this, or if it is even possible. Any ideas?

Upvotes: 3

Views: 4998

Answers (1)

i Code 4 Food
i Code 4 Food

Reputation: 2154

This execution time is entirely dependent on your system, i.e., hard drive / SSD, processor, etc. No one can tell you upfront what time it will take to run on YOUR computer, the only way you'll be able to get a precise estimate will be to run your script on sample files that add up to a small size such as 100mb, take note of how long it took, and base your estimations off of that.

def time_fun(data_size_in_gigs):
    benchmark = time_you_manually_tested_for_100mb
    time_to_run = data_size_in_gigs/0.1 * benchmark
    print "your code will take time_to_run hours to run"

Edit: In fact, you may want to save this benchmark (size,time) pair on a file, to which you also automatically add new entries whenever you actually run your script. Here in your function, you may for example want to retrieve the 2 benchmarks that are closest to the data_size you're currently estimating, and estimate off of them, just taking the average and making it proportional to the data_size you need. Each adjacent pair of benchmarks will define a different linear slope which will be the most accurate to data near it.

     |                  .
     |                 .
time |               .
     |            .
     |       .
     |_._________________
              size

Just avoid saving 2 benchmarks that differ by less than 200mb for example, as the actual time may vary and could ruin your estimation with entries such as (999mb, 100 minutes) followed by (1gb, 95 minutes).

The projection of the line defined by the 2 last points will be the closest estimate you have for new all-time-high data sizes.

Upvotes: 3

Related Questions