Improving performance when looping in a big data set

Question

I am making some spatio-temporal analysis (with MATLAB) on a quite big data set and I am not sure what is the best strategy to adopt in terms of performance for my script.

Actually, the data set is split in 10 yearly arrays of dimension (latitude,longitude,time)=(50,60,8760).

The general structure of my analysis is:

 for iterations=1:Big Number  

  1. Select a specific site of spatial reference (i,j).   
  2. Do some calculation on the whole time series of site (i,j). 
  3. Store the result in archive array.

 end

My question is:

Is it better (in terms of general performance) to have

1) all data in big yearly (50,60,8760) arrays as global variables loaded for once. At each iteration the script will have to extract one particular "site" (i,j,:) from those arrays for data process.

2) 50*60 distinct files stored in a folder. Each file containing a particular site time series (a vector of dimension (Total time range,1)). The script will then have to open, data process and then close at each iteration a specific file from the folder.

user3272910 · Accepted Answer

After doing some experiments it is clear that the second proposition with 3000 distinct files is much slower than having to manipulate big arrays loaded in workspace. But I didn't try to load all the 3000 files in workspace before computing (A tad to much).

It looks like Reshaping data help's a little bit.

Thanks to all contributors for your suggestions.

Improving performance when looping in a big data set

Answers (2)

Related Questions