Reputation: 9735
In my program I have a big 2-D grid (1000 x 1000, or more), each cell contains a small information. In order to represent this concept the choice is quite trivial: a matrix data structure.
The correspondent code (in C++) is something like:
int w_size_grid = 1000;
int h_size_grid = 1000;
int* matrix = new int[w_size_grid * h_size_grid];
As you can notice I've used a vector, but the principle is the same.
In order to access an element of the grid, we need a function that given a cell in the grid, identified by (x,y), it returns the value stored in that cell.
Mathematically:
f(x,y) -> Z
obviously:
f: Z^2 -> Z
where Z
is the set of integer numbers.
That can be trivially achieved with a linear function. Here a C++ code representation:
int get_value(int x, int y) {
return matrix[y*w_size_grid + x];
}
Actually the design concept requires a sort of "circular-continuous grid": the access indices for the cell can go out the limits of the grid itself.
That means, for example, the particular case: get_value(-1, -1);
is still valid. The function will just return the same value as get_value(w_size_grid - 1, h_size_grid -1);
.
Actually this is no a problem in the implementation:
int get_value(int x, int y) {
adjust_xy(&x, &y); // modify x and y in accordance with that rule.
return matrix[y*w_size_grid + x];
}
Anyway this is just an additional note in order to make the scenario more clear.
The problem presented above is very trivial and simple to design and to implement.
My problem comes with the fact that the matrix is updated with an high frequency. Each cell in the matrix is read and possibly updated with a new value.
Obviously the parsing of the matrix is done with two loop in according to a cache-friend design:
for (int y = 0; y < h_size_grid; ++y) {
for (int x = 0; x < w_size_grid; ++x) {
int value = get_value(x, y);
}
}
The inner cycle is x
since [x-1] [x] [x+1]
are stored contiguously. Indeed, that cycle exploits principle of locality.
The problem comes now because, actually in order to update a value in a cell, it depends on values in the adjacent cells.
Each cell has exactly eight neighbours, which are the cells that are horizontally, vertically, or diagonally adjacent.
(-1,-1) | (0,-1) | (1,-1)
-------------------------
(-1,0) | (0,0) | (0, 1)
-------------------------
(-1,1) | (0,1) | (1,1)
So the code is intuitively:
for (int y = 0; y < h_size_grid; ++y) {
for (int x = 0; x < w_size_grid; ++x) {
int value = get_value(x, y);
auto values = get_value_all_neighbours(x, y); // values are 8 integer
}
}
The function get_value_all_neighbours(x,y)
will access one row up and one row down in the matrix relatively to y
.
Since a row in the matrix is quite big, it involves a cache miss and it dirties the caches themselves.
One I have finally present you the scenario and the problem, my question is how to "solve" the problem.
Using some additional data structures, or reorganizing the data is there a way to exploit the caches and to avoid all those miss?
My feelings guide me toward a strategic data structure.
I've thought about a reimplementation of the order in which the values are stored in the vector, trying to stored in contiguous indices those cell which are neighbours.
That implies a no-more-linear function for get_value
.
After some thinking, I believe is not possible to find this no-linear function.
I've also thought some additional data-structure like hash-table to store adjacent value for each cell, but I think is an overkill more in space and maybe in CPU cycle also.
Upvotes: 4
Views: 1235
Reputation: 1899
Lets assume you have indeed a problem with cache misses that can't easily be avoided (referring to other answers here).
You could use a space filling curve to organize your data in a cache friendly way. Essentially, space filling curves map a volume or plane (such as your matrix) to a linear representations, such that values that are close together in space end (mostly) up close together in the linear representation. In effect, if you store the matrix in a z-ordered array, neighbouring elements have a high likelihood of being on the same memory page.
The best proximity mapping is available with the Hilbert Curve, but it is expensive to calculate. A better option may be a z-curve (Morton-Order). It provides good proximity and is fast to calculate.
Z-Curve: Essentially, to get the ordering, you have to interleave the bits of your x and y coordinate into a single value, called 'z-value'. This z-value determines the position in your list of matrix values (or even simply the index in an array if you use an array). The z-values are consecutive for a completely filled matrix (every cell is used). Conversely, you can de-interleave the position in the list (=array index) and get your x/y coordinates back.
Interleaving bits of two values is quite fast, there are even dedicated CPU instructions to do this with few cycles. If you can't find these (I can't, at the moment), you can simply use some bit twiddling tricks.
Upvotes: 2
Reputation: 57749
Actually, the data structure is not trivial, especially when optimizations are concerned.
There are two main issues to resolve: data content and data usage. Data content are the values in the data and the usage is how the data is stored, retrieved and how often.
Are all the values accessed? Frequently?
Data that is not accessed frequently can be pushed to slower media, including files. Leave the fast memory (such as data caches) for the frequently accessed data.
Is the data similar? Are there patterns?
There are alternative methods for representing matrices where a lot of the data is the same (such as a sparse matrix or a lower triangular matrix). For large matrices, maybe performing some checks and returning constant values may be faster or more efficient.
Data usage is a key factor in determining an efficient structure for the data. Even with matrices.
For example, for frequently access data, a map or associative array may be faster.
Sometimes, using many local variables (i.e. registers) may be more efficient for processing matrix data. For example, load up registers with values first (data fetches), operate using the registers, then store the registers back into memory. For most processors, registers are the fastest media for holding data.
The data may want to be rearranged to make efficient use of data cache's and cache lines. The data cache is a high speed area of memory very close to the processor core. A cache line is one row of data in the data cache. An efficient matrix can fit one or more row per cache line.
The more efficient method is to perform as many accesses to a data cache line as possible. Prefer to reduce the need to reload the data cache (because an index was out of range).
Can the operations be performed independently?
For example, scaling a matrix, where each location is multiplied by a value. These operations don't depend on other cells of the matrix. This allows the operations to be performed in parallel. If they can be performed in parallel, then they can be delegated to processors with multiple cores (such as GPUs).
When a program is data driven, the choice of data structures is not trivial. The content and usage are important factors when choosing a structure for the data and how the data is aligned. Usage and performance requirements will also determine the best algorithms for accessing the data. There are already many articles on the internet for optimizing for data driven applications and best usage of data caches.
Upvotes: 1