Reputation: 89

How to take the running or moving average from multiple daily files

I have 11 years (2007 to 2017) daily files of temperature. There are a total of 11*365 = 4015 NetCDF files. Each file contains latitude (100,), longitude (360,) dimensions and a temperature variable of these with size (360, 100). I want to find the 15 days running (moving) average at each grid point ignoring the NaN values if present. That means 15 files need to be used to find the mean. I have the following function to read all the daily files from a folder. e.g. mean of files_list[0:15], files_list[1:16], files_list[2:17]...., files_list[4000:] need to be found. And each file mean need to be saved as a new NetCDF file. I have an idea of creating a NetCDF file. But could not find the running or moving average.

Here is my code :

def files_list (working_dir, extension):
    '''
    input = working directory and extension of file(eg. *.nc)
    outout = returns the list of files in the folder
    '''
    file_full_path = os.path.join(working_dir)
    os.chdir(working_dir)
    files = glob.glob(os.path.join(file_full_path,extension)) 
    files = natsort.natsorted(files)
    files_list= []       #Empty lsit of files
    j = 0 
    for j in range(0,len(files)):
        files_list.append(os.path.basename(files[j])) #appending each files in a directory to file list 
    return files_list

Upvotes: 2

Answers (3)

Light_B

Reputation: 1800

A bit late to answer but for someone reading in the future, xarray also provides an easy Pythonic solution very similar to @Adrian Tomkins answer, where one can first merge files of each year and then merge them together in one file due to the limit of the number of files that can be open in a system.

for yr in range(2011,2018):
    file_name = str(yr) + 'merge.nc'
    xr.open_mfdataset(str(yr)*, combine='nested', concat_dim='time').to_netcdf(file_name)

xr.open_mfdataset(*merge.nc, combine='nested', concat_dim='time').to_netcdf(merge_all.nc)
ds = xr.open_dataset(merge_all.nc, chunks={'lat'=10, 'lon'=10}) # option to chunk if file size is too large, can also be used earlier with open_mfdataset
ds_rolling_mean = ds.rolling(time=15, center=True).mean()

Edit: One big advantage of xarray over other classical tools is that one can easily do out of memory computations and scale your computations over multiple cores thanks to dask. For example, if you have to do some pre-processing of your files before merging then xr.open_mfdataset takes a user-defined preprocessing function as preprocess argument and setting 'parallel=True' will pre-process your input files in parallel before merging.

Upvotes: 1

ClimateUnboxed

Reputation: 8087

This is not a solution in python, but if your files are called file_20061105.nc etc etc you could merge them with cdo (climate data operators) from the command line and then use the runmean function

cdo mergetime file_*.nc merged_file.nc
cdo runmean,15 merged_file.nc runmean.nc

On some systems there is a limit to the number of files you can have open, in which case you may need to merge the files one year at a time first

for year in {2007..2017} ; do 
  cdo mergetime file_${year}????.nc merged_${year}.nc
done
cdo mergetime merged_????.nc merged_file.nc
cdo runmean,15 merged_file.nc runmean.nc

Just as an alternative way to do this quickly from the command line.

If you want to do this task in a python program, then you can cat the files into a single one this way first (or loop over the files in python and read them into a single numpy array of 100x360x4000), and then perform the running mean in python, there is already a stackoverflow question on this task here:

Moving average or running mean

Upvotes: 2

RaamEE

Reputation: 3507

As to my comment above:

"How many items do you have in each file? ... If each file contains thousands of grid points, I would start by sorting the different grid points to separate files. Each file will hold the same grid point for all dates, sorted by date. This way it would be simple to load an entire file of a single grid point and calculate a running average on it."

Now that you have a file for a single gridpoint, I would load the data into a list and run this simple running average calculation. (Since you have acceess to the entire dataset you can use this code. For cases where the average is calculated on the run and there is no history of the results, you can use the algorithms specified here: Wikipedia - Moving Average)

#Generate a list of 10 items
my_gridpoints_data=[x for x in range(1, 11)]
print(my_gridpoints_data)

#The average calculation window is set to 3, so the average is for 3 items at a time
avg_window_width: int = 3
avg: float = 0.0
sum: float = 0.0

# Calculate the average of the first 3 items (avg_window_width is 3)
for pos in range(0, avg_window_width):
    sum = sum + my_gridpoints_data[pos]
avg = sum / avg_window_width
print(avg)

# Then move the window of the average by subtracting the leftmost item 
# and adding a new item from the right
# Do this until the calculation window reaches the list's last item

for pos in range(avg_window_width, my_gridpoints_data.__len__()):
    sum = sum + my_gridpoints_data[pos] - my_gridpoints_data[pos - avg_window_width]
    avg = sum/avg_window_width
    print(avg)

The result output is:

[1, 2, 3, 4, 5, 6, 7, 8, 9]
2.0
3.0
4.0
5.0
6.0
7.0
8.0

Upvotes: 1

How to take the running or moving average from multiple daily files

Answers (3)

Related Questions