Reputation: 3143
I have a code that is organized in such a way that I have a struct of structs, and in my main method I have a number of functions that take pointer to the main struct as an argument. I am wondering certain choices I made in such an organization would affect the speed of my code adversely. A minimal example code for the sake of my question would look like this:
#define NPMAX 50000
typedef struct Particles{
double *X, *Y, *Z;
} Particles;
typedef struct Properties{
int Npart;
double Box[3];
double minDist;
} Properties;
typedef struct System{
Properties props;
Particles parts;
} System;
void function(System *sys){
double dist;
int i;
for(i=0; i<sys->props.Npart; i++){
dist = pow(sys->parts.X[i],2.) + pow(sys->parts.Y[i],2.) + pow(sys->parts.Z[i],2.);
if(dist<sys->props.minDist) sys->props.minDist=dist;
}
return;
}
With the following main method:
int main(){
System sys;
sys.parts.X = (double *)malloc(sizeof(double) * NPMAX);
sys.parts.Y = (double *)malloc(sizeof(double) * NPMAX);
sys.parts.Z = (double *)malloc(sizeof(double) * NPMAX);
//... some code to populate sys->parts.X, Y, and Z ...
sys.props.Npart = 1000;
sys.props.Box[0] = 10.; //etc.
sys.props.minDist = 9999.;
function(&sys);
// some file I/O
return;
}
My question is, given this data structure, have I organized my function in the best possible way for efficiency? I mean that speed-wise, not in terms of memory. More specifically:
Is accessing and assigning values to sys->parts.X[i]
slower than creating a pointer directly to sys->parts
within the function and doing parts->X[i]
, for instance?
Is having variables allocated both in heap and stack within the same struct a wise choice speed-wise? Is the program losing time trying to access these values in the memory because of this mix?
Should I expect this approach to be faster than just using a global variable for each individual variable declared within the structs?
I have access to intel compilers in addition to gcc
and I'm compiling with the -O3
flag.
Upvotes: 0
Views: 220
Reputation: 141768
Is accessing and assigning values to sys->parts.X[i] slower than creating a pointer directly to sys->parts within the function and doing parts->X[i], for instance?
From the compiler point of view only side-effects are important. I think both cases should be optimized to the same instructions by a sine compiler with a good optimization. Let's test it out:
void function(System *sys){
double dist;
int i;
for(i=0; i<sys->props.Npart; i++){
dist = pow(sys->parts.X[i],2.) + pow(sys->parts.Y[i],2.) + pow(sys->parts.Z[i],2.);
if(dist<sys->props.minDist) sys->props.minDist=dist;
}
return;
}
void function2(System *sys){
double dist;
int i;
for(i=0; i<sys->props.Npart; i++){
const struct Particles * const p = &sys->parts;
dist = pow(p->X[i],2.) + pow(p->Y[i],2.) + pow(p->Z[i],2.);
if(dist<sys->props.minDist) sys->props.minDist=dist;
}
return;
}
both function compile into the same assembly instructions, as shown at godbolt. Throughout this post I am using gcc8.2 with 64-bit x86_64 architecture.
Is having variables allocated both in heap and stack within the same struct a wise choice speed-wise? Is the program losing time trying to access these values in the memory because of this mix?
The real answer should be: depends on the architecture. On x86_64 I believe there will be no measurable difference between accessing (not allocating) array members when:
System sys_instance;
System *sys = &sys_instance;
double Xses[NPMAX];
sys->parts.X = Xses;
double Yses[NPMAX];
sys->parts.Y = Yses;
double Zses[NPMAX];
sys->parts.Z = Zses;
and:
System *sys = alloca(sizeof(*sys));
sys->parts.X = alloca(sizeof(*sys->parts.X) * NPMAX);
sys->parts.Y = alloca(sizeof(*sys->parts.Y) * NPMAX);
sys->parts.Z = alloca(sizeof(*sys->parts.Z) * NPMAX);
and:
System *sys = malloc(sizeof(*sys));
sys->parts.X = malloc(sizeof(*sys->parts.X) * NPMAX);
sys->parts.Y = malloc(sizeof(*sys->parts.Y) * NPMAX);
sys->parts.Z = malloc(sizeof(*sys->parts.Z) * NPMAX);
or any of the mix of these forms. Whether using malloc
or alloca
- both result in a pointer, that from the accessing point of view is the same. But keep in mind CPU cache and other architecture dependent optimization. Using malloc will result in significantly "slower" allocation.
Should I expect this approach to be faster than just using a global variable for each individual variable declared within the structs?
Even if you do:
static System sys_static;
System *sys = &sys_static;
static double X_static[NPMAX];
sys->parts.X = X_static;
static double Y_static[NPMAX];
sys->parts.Y = Y_static;
static double Z_static[NPMAX];
sys->parts.Z = Z_static;
still to your function function
a pointer to sys is passed and all accesses are the same.
In same rare cases and when not using malloc
with sys
initialization having no side-effects, your function declared static and a good optimizer, it could be optimized out and the sys->props.minDist
could be precalculated by the compiler on the compilation stage. But I wouldn't aim for that, unless you want to use C++ with consteval
or constexpr
.
>
If the number of X
and Y
and Z
is the same I would go with what @WhozCraig suggested.
void function(System *sys){
double dist;
int i;
for(i=0; i<sys->props.Npart; i++){
const struct Particles * const p = &sys->parts[i];
dist = pow(p->X, 2.) + pow(p->Y, 2.) + pow(p->Z, 2.);
if(dist<sys->props.minDist) sys->props.minDist=dist;
}
return;
}
This will save cycles needed for multiplication. Also it will reduce the number of malloc's needed to allocate (and resize) elements. The sys->parts[i]
part may be calculated once for the whole dist=
line. In case of sys->parts.X[i]
the sys->parts
may ba calculated once, then for each X
and Y
and Z
the value pointer + sizeof(elem) * i
must be calculated. But, in case of a decent compiler and optimizer, it makes no difference. But really, this approach resulted in different assembly, but the same number of instructions, see godbolt.
Definitely I would declare all the variables that denote size of an object as having size_t
type, that is the loop counter i
as having size_t
type and sys->propc.Npart
would also be size_t
type. They represent the element count, that's what size_t
type is used for.
But I would definitely hand optimize the loop. You are accessing sys->props.Npart
in each loop check. If staying with pointers, I would declare double *X, *Y , *Z;
to be restrict to each other - I suppose you don't expect them to be equal.
Also you accessing sys->procp.minDist
in each loop and conditionally assigning it. You need to deference sys
here only twice - on the beginning and on the end (unless you have some parallel code that depends on minDist
value in mids of calculation, which I hope you don't, cause you have no means of synchronization in your current code). Use a local variable and access sys
as little as possible times you can.
I would replace the pow
calls with variables assignment (so that the variable is derefenced only once) and plain multiplication. Compilers may assume the derefenced variable may change mid-loop if there are any assigments - let's protect against that. However a good optimizer will optimize the pow(..., 2.)
calls.
If performance is so much needed, I would go with:
void function3(System * restrict sys){
double minDist = sys->props.minDist;
for (const struct Particles
* const start = &sys->parts[0],
* const stop = &sys->parts[sys->props.Npart],
* p = start; p < stop; ++p) {
const double X = p->X;
const double Y = p->Y;
const double Z = p->Z;
const double dist = X * X + Y * Y + Z * Z;
if (dist < minDist) {
minDist = dist;
}
}
sys->props.minDist = minDist;
return;
}
Which results in tiny bit of less assembly code, mostly because sys->propc.minDist
is not accessed every time in the loop, no need to use and increment some temporary counter. Use const
s so to give hints to compiler that you won't modify a variable.
Upvotes: 1
Reputation: 560
The memory layout looks fine. With only a few allocations the structure doesn't matter that much. Those double arrays do offer a nice option for vector computing with a temporary array in between.
// collect computations first
double dist[NPMAX];
// process 8 64-bit floating-points at a time
int n = sys->props.Npart & ~7;
for(int i = 0; i < n; i += 8){
_m512d xsq = _mm512_sqrt_pd(&sys->parts.X[i]);
_m512d ysq = _mm512_sqrt_pd(&sys->parts.Y[i]);
_m512d zsq = _mm512_sqrt_pd(&sys->parts.Z[i]);
dist[i] = xsq + ysq + zsq;
}
// deal with remainders (if any)
for (int i = n; i < sys->props.Npart; i++)
dist[i] = sqrt(sys->parts.X[i]) + sqrt(sys->parts.Y[i]) + sqrt(sys->parts.Z[i]);
// then find lowest
for (int i = 0; i < sys->props.Npart; i++)
if(dist[i] < sys->props.minDist) sys->props.minDist = dist[i];
Upvotes: 1