Reputation: 1165
I have a performance bottleneck in my OpenGL ES-program by the moment. I thought it would work well - using VBO, textureatlas, few bindings for every draw-call and so on. But when using many sprites at the same time the performance drops alot. I found that the bottleneck is CPU-bound (a bit surprised). More precisely - the bottleneck could be dervied to a method that calculates the screenposition for each rectangles four vertcies - x1, y1, x2, y2, x3, y3, x4, y4. This is used for collition-detection. What I do in this method is to repeat what is done in the shaders and i think many cpu-cycles is caused by the MV-multiplications.
Matrix.multiplyMV(resultVec, 0, mModelMatrix, 0, rhsVec, 0);
the rhsVec is a float-array that stores the vertices as described above.
Since this seem to be the bottleneck I wonder how I could access the same vector in the shader when for instance the clip-coordinates are calculated? Clip-coordinates or even better the coordinates produced by he shaders further down in the pipe-line.
the vertex-shader
uniform mat4 u_MVPMatrix;
uniform mat4 u_MVMatrix;
varying vec2 v_TexCoordinate;
attribute vec4 position;
void main()
{
v_TexCoordinate = a_TexCoordinate
gl_Position = u_MVPMatrix * a_Position;
}
snippet of onSurfaceCreated
final int vertexShaderHandle = ShaderHelper.compileShader(GLES20.GL_VERTEX_SHADER, vertexShader);
final int fragmentShaderHandle = ShaderHelper.compileShader(GLES20.GL_FRAGMENT_SHADER, fragmentShader);
mProgramHandle = ShaderHelper.createAndLinkProgram(vertexShaderHandle, fragmentShaderHandle,
new String[] {"a_Position", "a_Color", "a_Normal", "a_TexCoordinate"});
textureHandle = TextureHelper.loadTexture(context);
GLES20.glUseProgram(mProgramHandle);
mMVPMatrixHandle = GLES20.glGetUniformLocation(mProgramHandle, "u_MVPMatrix");
mMVMatrixHandle = GLES20.glGetUniformLocation(mProgramHandle, "u_MVMatrix");
//mColorHandle = GLES20.glGetAttribLocation(mProgramHandle, "a_Color");
mTextureCoordinateHandle = GLES20.glGetAttribLocation(mProgramHandle, "a_TexCoordinate");
mPositionHandle = GLES20.glGetAttribLocation(mProgramHandle, "a_Position");
the method that makes the vertex transformation (the bottleneck)
private void calcPos(int index) {
int k = 0;
for (int i = 0; i < 18; i += 3) {
rhsVec[0] = vertices[0 + i];
rhsVec[1] = vertices[1 + i];
rhsVec[2] = vertices[2 + i];
rhsVec[3] = 1;
// *** Step 1 : Getting to eye coordinates ***
Matrix.multiplyMV(resultVec, 0, mModelMatrix, 0, rhsVec, 0);
// *** Step 2 : Getting to clip coordinates ***
float[] rhsVec2 = resultVec;
Matrix.multiplyMV(resultVec2, 0, mProjectionMatrix, 0, rhsVec2, 0);
// *** Step 3 : Getting to normalized device coordinates ***
float inv_w = 1 / resultVec2[3];
for (int j = 0; j < resultVec2.length - 1; j++) {
resultVec2[j] = inv_w * resultVec2[j];
}
float xPos = (resultVec2[0] * 0.5f + 0.5f) * game_width;
float yPos = (resultVec2[1] * 0.5f + 0.5f) * game_height;
float zPos = (1 + resultVec2[2]) * 0.5f;
SpriteData sD = spriteDataArrayList.get(index);
switch (k) {
case 0:
sD.xPos[0] = xPos;
sD.yPos[0] = yPos;
break;
case 1:
sD.xPos[2] = xPos;
sD.yPos[2] = yPos;
break;
case 2:
sD.xPos[3] = xPos;
sD.yPos[3] = yPos;
break;
case 3:
sD.xPos[1] = xPos;
sD.yPos[1] = yPos;
break;
}
k++;
if (i == 3) {
i += 9;
}
}
This method is called for each sprite - so for 100 sprites its repeated 100 times. Probably the MV-multiplications hits performance?
Upvotes: 0
Views: 258
Reputation: 2380
To answer the main question, I don't think it's possible to grab the transformed verts from the GPU.
First pass at optimizing the loop. First off, don't do things over and over inside the loop when they always produce the same result. Do it outside the loop. Especially function or property calls.
Next, you can multiply 2 matrices together in such a way that their transforms are applied in order with a single matrix multiplication. Although it seems like you are untransforming the final result back into screen space.
You are copying data, and then using that data without changing it. I know that the matrix multiplication is probably expecting 4 floats or a Vec4, but you can write a matrix multiplication that avoids the copy and fills in the w parameter.
Avoid calculations that you ultimately don't use.
Cache results and don't recalculate unless they change.
private void calcPos(int index) {
// get only once, not every loop
SpriteData sD = spriteDataArrayList.get(index);
int[] vIndices = {0, 1, 2, 5}; // the 4 verts you want
// multiply once outside the loop, use result inside loop
Matrix mvpMatrix = mModelMatrix * mProjectionMatrix; // check order
for (int i = 0; i < 4; ++i) { // only grab verts you want, no need for fancy skips
int nVert = 3 * vIndices[i]; // 3 floats per vert
// should avoid copying data when you aren't going to change the copy
rhsVec[0] = vertices[0 + nVert];
rhsVec[1] = vertices[1 + nVert];
rhsVec[2] = vertices[2 + nVert];
rhsVec[3] = 1; // need to write multiplyMV3 that takes pointer to 3 floats
// and fills in the w param, then no need to copy
// E.g. :
// Matrix.multiplyMV3(resultVec2, 0, mvpMatrix, 0, &vertices[nVert], 0);
// do both matrix multiplcations at same time
Matrix.multiplyMV(resultVec2, 0, mvpMatrix, 0, rhsVec, 0);
// *** Step 3 : Getting to normalized device coordinates ***
float inv_w = 1 / resultVec2[3];
for (int j = 0; j < 2; ++j) // just what we need
resultVec2[j] *= inv_w;
// Curious... Transform into projection space, just to transform
// back into screen space. Perhaps you are transforming too far?
float xPos = (resultVec2[0] * 0.5f + 0.5f) * game_width;
float yPos = (resultVec2[1] * 0.5f + 0.5f) * game_height;
// float zPos = (1 + resultVec2[2]) * 0.5f; // not used
switch (i) {
case 0:
sD.xPos[0] = xPos;
sD.yPos[0] = yPos;
break;
case 1:
sD.xPos[2] = xPos;
sD.yPos[2] = yPos;
break;
case 2:
sD.xPos[3] = xPos;
sD.yPos[3] = yPos;
break;
case 3:
sD.xPos[1] = xPos;
sD.yPos[1] = yPos;
break;
}
}
Upvotes: 1