Ian
Ian

Reputation: 3709

Architecture for Sprite Rendering in OpenGL

I have a lot of sprites to render, and wanted to get any feedback from folks who have pushed performance in this area.

So I sort by shaders and texture. And have batches of sprites with the same render settings in VBOs to send to the shaders for rendering. All normal stuff. My sprites are all square and all have the same basic data: central position (P), orientation (O), scale (S), rgb color (Col) and global opacity (Alpha). I have to update the position and orientation in CPU code, (though about 50% of sprites don't change between any given pair of frames) and scale, color and opacity almost never change for a sprite, but not actually never.

I can't assume geometry shaders (I will support them, but the question is moot in that case).

Should I:

  1. When I update the sprite positions, calculate the vertex positions on the CPU. Making the vertex shader a simple transform step. (Advantage of a significantly smaller amount of data to update each frame, but the CPU has to do a lot of trig).

  2. Put the POS data into the VBO as additional data, duplicated for the 4 verts, then have the vert position just be simple offsets (-1,-1; -1,1; 1,1; 1,-1) and do the trig in the shader (Advantage that the GPU is doing more calculation, but each vertex has 5 extra words of data).

  3. It isn't obvious which is better, so both approaches need profiling to see what happens.

Obviously I can do 3, but I thought it would be useful to ask this question to see if I'm just lacking a gestalt about what should be faster. And either way the answer can help other serious sprite/particle implementers later.

Upvotes: 4

Views: 2131

Answers (3)

kravemir
kravemir

Reputation: 10986

I've found little bandwidth improvement. I assume each your sprite has 4 vertices(6 indices), then you can simply use gl_VertexID % 4 instead of flags.

Per vertex attributes:

  • vec2 float position, float orientation, float scale - sprite geometry data (16B)
  • uint flags - optional flags for special sprites (4B)
  • float param - optional param for smooth sprite transformations (4B)

Uniforms:

  • vec2 vertexPosition[4] - relative position of each corner - you can use it to specify center
  • vec2 textureCoord[4] - texture coord for each corner, you can also use 4*n texcoords for n sprite states that can be defined via flags

This setup uses only 16B per vertex for simple sprites.

Upvotes: 1

kroneml
kroneml

Reputation: 687

From my experience with large numbers of particles, I would use option (2.). Maybe you can pack the index of the offset/direction into your data (e.g. as w-component of your postion vector, if you don't use it so far)? 0 = (-1,-1); 1 = (-1,1); 2 = (1,1); 3 = (1,-1).

(As suggested by Ian, I just copied my comment to an answer!)

@Ian: If I understand you corretly, you said that you have a global opacity/alpha, so you should be able to use an uniform for that and use the w-component of your vec4 color for the flag. However, I doubt that this will make any difference...

By the way, the geometry shader solution you already mentioned should not only be more elegant, but also a bit faster.

Upvotes: 2

Ian
Ian

Reputation: 3709

So I did (3) and profiled. And, as kronemi said, option 2 won convincingly.

The best performing structure was two VBOs:

  1. vec2 float pos, float orientation, float scale (16 bytes/vertex)
  2. vec2 float tex, vec4 ubyte color, uint flags (16 bytes/vertex)

Where the flags encode the corner of the sprite, so we have 0x00000001 for right, and 0x00000002 for bottom. This allows the code to update sprite location to walk through the first VBO and set the values four at a time without any trig or other logic. All the math happens in the vertex shader.

In my tests merging the two VBOs into one performed better if the number of position updates was not much different to the number of texture/color updates. I assume this is because then the vertices are 32-byte aligned. But in my application (and I assume most people's), position is updated most frames, but other things never are, and having a smaller buffer to push down to the graphics card seemed to win.

Upvotes: 3

Related Questions