Staffan
Staffan

Reputation: 1754

Tuning OpenGL performance for geometry throughput

This has probably been asked over and over but I couldn't find anything useful so here it goes again...

In my application I need to render a fairly large mesh (a couple of million triangles or more) and I'm having some problems getting decent frame rates out of it. The CPU is pretty much idling so I'm definitely GPU-bound. Changing the resolution doesn't affect performance, so it's not fragment- or raster-bound.

The mesh is dynamic (but locally static) so I cannot store the whole thing in the video card and render it with one call. For application specific reasons the data is stored as an octree with voxels in the leafs, with means I get frustum culling basically for free. The vertex data consist of coordinates, normals and colors - no textures or shaders are used.

My first approach was to just render everything from memory using one big STREAM_DRAW VBO, which turned out to be too slow. My initial thought was that I was perhaps overtaxing the bus (pushing ~150 MiB per frame), so I implemented a caching scheme that stores geometry recently used to render the object in static VBOs on the graphics card, with each VBO storing a couple of 100 KiB to a couple of MiB worth of data (storing more per VBO gives more cache thrashing, so there's a trade-off here). The picture below is an example of what the data looks like, where everything colored red is drawn from cached VBOs.

Example of the rendered data
(source: sourceforge.net)

As the numbers below show, I don't see a spectacular increase in performance when using the cache. For a fully static mesh of about 1 million triangles I get the following frame rates:

So my questions is how do I speed this up? I.e.:

I'm not interested in answers suggesting LOD (I already tested this), vendor-specific tips or using OpenGL features from anything later than 1.5.

Upvotes: 15

Views: 2975

Answers (3)

rioki
rioki

Reputation: 6118

I don't know about your "mesh" but it seems like they are all cubes. If it is possible for you, render a single union cube to a display list and render a scaled version of that display list. That often gives a 10x speedup, since the buss is not pumped with vertex data or the video memory exhausted.

Of course that depends on your ability to change the data. It might not be the case if it really is not like on the picture.

Upvotes: 0

basszero
basszero

Reputation: 30014

You're probably not going to like this response....

I've found your problem: Intel GM965 with open source Linux drivers

While my current job does not hit your volume of data, we've rendered several million vertexes in VBO and Intel graphics hardware/drivers have proven useless. Get yourself an NVidia card (and get over having to use the binary driver, it just works) and you'll be all set. Doesn't even have to be current generation though a top end Quadro (if work is paying) or top end GTX 400 series (if you're paying or just trying to save some bucks at work) should do just fine w/ the latest drivers. You could also try to find a machine w/ this hardware to test on if upgrading your machine is not an option.

Upvotes: 5

elmattic
elmattic

Reputation: 12174

I would use a performance profiler first (like gDEBugger), so you can figure out if you are vertex, fragment or bus limited, etc. It's hard to guess what optimizations to perform in such a particular case (Intel + open source drivers).

Did you also try VA mode? Are you using glDrawElements? glDrawArrays? Is the data vertex-cache friendly (pre and post transform)?

Upvotes: 0

Related Questions