Reputation: 709

How can I make my python code run faster

I am working on code that loops over multiple netcdf files (large ~28G). The netcdf files have multiple 4D variables[time, east-west, south-north, height] throughout a domain. The goal is to loop over these files and to loop over each location of all of these variables in the domain and pull certain variables to store into a large array. When there is missing or incomplete files I fill the values with 99.99. Right now I am just testing by looping over 2 daily netcdf files but for some reason it is taking forever (~14 hours). I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code. Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:

#Domain to loop over
k_space = np.arange(0,37)
j_space = np.arange(80,170)
i_space = np.arange(200,307)

predictors_wrf=[]
names_wrf=[]

counter = 0
cdate = start_date
while cdate <= end_date:
    if cdate.month not in month_keep:
        cdate+=inc
        continue
    yy = cdate.strftime('%Y')        
    mm = cdate.strftime('%m')
    dd = cdate.strftime('%d')
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if os.path.isfile(filename):
                        f = nc.Dataset(filename,'r')
                        times = f.variables['Times'][1:]
                        num_lines = times.shape[0]
                        if num_lines == 144:
                            u = f.variables['U'][1:,k,j,i]
                            v = f.variables['V'][1:,k,j,i]
                            wspd = np.sqrt(u**2.+v**2.)
                            w = f.variables['W'][1:,k,j,i]
                            p = f.variables['P'][1:,k,j,i]
                            t = f.variables['T'][1:,k,j,i]
                        if num_lines < 144:
                            print "partial files for WRF: "+ filename
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                    else:
                        u = np.ones((144,))*99.99
                        v = np.ones((144,))*99.99
                        wspd = np.ones((144,))*99.99
                        w = np.ones((144,))*99.99
                        p = np.ones((144,))*99.99
                        t = np.ones((144,))*99.99
                        counter=counter+1
                    predictors_wrf.append(u)
                    predictors_wrf.append(v)
                    predictors_wrf.append(wspd)
                    predictors_wrf.append(w)
                    predictors_wrf.append(p)
                    predictors_wrf.append(t)
                    u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                    v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                    wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                    w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                    p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                    t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                    names_wrf.append(u_names)
                    names_wrf.append(v_names)
                    names_wrf.append(wspd_names)
                    names_wrf.append(w_names)
                    names_wrf.append(p_names)
                    names_wrf.append(t_names)
    cdate+=inc

Upvotes: 2

Answers (3)

Wenlong Liu

Reputation: 444

For your questions, I think multiprocessing will help a lot. I went through your codes and have some pieces of advice here.

Not using start time, but the filenames as the iterators in your codes.

Wrap a function to find out all the file names based on time and return a list of all filenames.

def fileNames(start_date, end_date):
    # Find all filenames.
    cdate = start_date
    fileNameList = [] 
    while cdate <= end_date:
        if cdate.month not in month_keep:
            cdate+=inc
            continue
        yy = cdate.strftime('%Y')        
        mm = cdate.strftime('%m')
        dd = cdate.strftime('%d')
        filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
        fileNameList.append(filename)
        cdate+=inc

    return fileNameList

Wrap your codes that pull your data and fill with 99.99, the input for the function is the file name.

def dataExtraction(filename):
    file_exists = os.path.isfile(filename)
    if file_exists:
       f = nc.Dataset(filename,'r')
       times = f.variables['Times'][1:]
       num_lines = times.shape[0]
    for i in i_space:
        for j in j_space:
            for k in k_space:
                if file_exists:    
                    if num_lines == 144:
                        u = f.variables['U'][1:,k,j,i]
                        v = f.variables['V'][1:,k,j,i]
                        wspd = np.sqrt(u**2.+v**2.)
                        w = f.variables['W'][1:,k,j,i]
                        p = f.variables['P'][1:,k,j,i]
                        t = f.variables['T'][1:,k,j,i]
                    if num_lines < 144:
                        print "partial files for WRF: "+ filename
                        u = np.ones((144,))*99.99
                        v = np.ones((144,))*99.99
                        wspd = np.ones((144,))*99.99
                        w = np.ones((144,))*99.99
                        p = np.ones((144,))*99.99
                        t = np.ones((144,))*99.99
                    else:
                        u = np.ones((144,))*99.99
                        v = np.ones((144,))*99.99
                        wspd = np.ones((144,))*99.99
                        w = np.ones((144,))*99.99
                        p = np.ones((144,))*99.99
                        t = np.ones((144,))*99.99
                        counter=counter+1
                    predictors_wrf.append(u)
                    predictors_wrf.append(v)
                    predictors_wrf.append(wspd)
                    predictors_wrf.append(w)
                    predictors_wrf.append(p)
                    predictors_wrf.append(t)
                    u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                    v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                    wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                    w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                    p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                    t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                    names_wrf.append(u_names)
                    names_wrf.append(v_names)
                    names_wrf.append(wspd_names)
                    names_wrf.append(w_names)
                    names_wrf.append(p_names)
                    names_wrf.append(t_names)


    return zip(predictors_wrf, names_wrf)

Using multiprocessing to do your work. Generally, all the computers have more than 1 CPU cores. Multiprocessing will help increase the speed when there are massive CPU calculations. To my previous experience, multiprocessing will reduce up to 2/3 time consumed for huge datasets.

Updates: After testing my codes an files again on Feb. 25, 2017, I found that using 8 cores for a huge dataset saved me 90% of collapsed time.
```
if __name__ == '__main__':
      from multiprocessing import Pool  # This should be in the beginning statements.
      start_date = '01-01-2017'
      end_date = '01-15-2017'
      fileNames = fileNames(start_date, end_date)
      p = Pool(4) # the cores numbers you want to use.
      results = p.map(dataExtraction, fileNames)
      p.close()
      p.join()
```
Finally, be careful about the data structures here as it is pretty complicated. Hope this helps. Please leave comments if you have any further questions.

Upvotes: 2

Selecsosi

Reputation: 1646

This is a lame first pass to tighten up your forloops. Since you only use the file shape once per file, you can move the handling outside the loop which should reduce the amount of loading of data in interrupting processing. I still don't get what counter and inc do as they don't seem to be updated in the loop. You definitely want to look into repeated string concatenation performance, or how the performance of your appending to predictors_wrf and names_wrf looks as starting points

k_space = np.arange(0,37)
j_space = np.arange(80,170)
i_space = np.arange(200,307)

predictors_wrf=[]
names_wrf=[]

counter = 0
cdate = start_date
while cdate <= end_date:
    if cdate.month not in month_keep:
        cdate+=inc
        continue
    yy = cdate.strftime('%Y')        
    mm = cdate.strftime('%m')
    dd = cdate.strftime('%d')
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    file_exists = os.path.isfile(filename)
    if file_exists:
        f = nc.Dataset(filename,'r')
        times = f.variables['Times'][1:]
        num_lines = times.shape[0]
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if file_exists:    
                        if num_lines == 144:
                            u = f.variables['U'][1:,k,j,i]
                            v = f.variables['V'][1:,k,j,i]
                            wspd = np.sqrt(u**2.+v**2.)
                            w = f.variables['W'][1:,k,j,i]
                            p = f.variables['P'][1:,k,j,i]
                            t = f.variables['T'][1:,k,j,i]
                        if num_lines < 144:
                            print "partial files for WRF: "+ filename
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                    else:
                        u = np.ones((144,))*99.99
                        v = np.ones((144,))*99.99
                        wspd = np.ones((144,))*99.99
                        w = np.ones((144,))*99.99
                        p = np.ones((144,))*99.99
                        t = np.ones((144,))*99.99
                        counter=counter+1
                    predictors_wrf.append(u)
                    predictors_wrf.append(v)
                    predictors_wrf.append(wspd)
                    predictors_wrf.append(w)
                    predictors_wrf.append(p)
                    predictors_wrf.append(t)
                    u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                    v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                    wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                    w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                    p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                    t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                    names_wrf.append(u_names)
                    names_wrf.append(v_names)
                    names_wrf.append(wspd_names)
                    names_wrf.append(w_names)
                    names_wrf.append(p_names)
                    names_wrf.append(t_names)
    cdate+=inc

Upvotes: 2

erewok

Reputation: 7835

I don't have very many suggestions, but a couple of things to note.

Don't open that file so many times

First, you define this filename variable and then inside this loop (deep inside: three for-loops deep), you are checking if the file exists and presumably opening it there (I don't know what nc.Dataset does, but I'm guessing it must open the file and read it):

filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if os.path.isfile(filename):
                        f = nc.Dataset(filename,'r')

This is going to be pretty inefficient. You can certainly open it once if the file doesn't change before all of your loops.

Try to Use Fewer for-loops

All of these nested for-loops are compounding the number of operations you need to perform. General suggestion: try to use numpy operations instead.

Use CProfile

If you want to know why your programs are taking a long time, one of the best ways to find out is to profile them.

Upvotes: 1

How can I make my python code run faster

Answers (3)

Related Questions