Reputation: 37481
I've been writing an experimental app in which I render 100 chunks of 16x16x16 cubes using VBO. I'm doing this because a dozen people have praised beyond measure VBO, and have told me that it's much better performing than the per-chunk display lists I'm using in my actual Minecraft-style game.
It's been a painful process, trying to adapt many poorly written tutorials that only focus on a single cube/triangle into something that can handle the amount of drawing I need. I'm still not at all convinced that VBOs are better for my game than display lists.
For the most part, I've finally tweaked the code so that my interleaved VBO data is built only once (when the chunk loads) and then each render
call, the buffer ID is bound and glDrawArrays
is called.
I'm slowly increasing the quantity of blocks/chunks in this experimental app to see how performance handles. In the actual game, it'll have to handle 16x16x128 blocks in every chunk, with at most 20x20 chunks loaded. Roughly 60% of those will be solid blocks that are rendered, so maybe 8 million blocks. That renders without much issue using the display list method I started with.
However, even though I have VBO render performance within tolerable levels now, I can't generate a radius of 10 chunks without hitting a memory limit:
Exception in thread "main" java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at org.lwjgl.BufferUtils.createByteBuffer(BufferUtils.java:60)
at org.lwjgl.BufferUtils.createFloatBuffer(BufferUtils.java:110)
at com.helion3.opengl.rendering.TextureQuadRenderer.<init>(TextureQuadRenderer.java:25)
at com.helion3.opengl.shapes.Chunk.<init>(Chunk.java:13)
at com.helion3.opengl.shapes.World.<init>(World.java:18)
at com.helion3.opengl.Game.start(Game.java:90)
at com.helion3.opengl.Launcher.main(Launcher.java:19)
I'm pretty confident that my buffers are setup with the right counts needed. I call:
BufferUtils.createFloatBuffer(
using 192
floats per cube (3 vertices, 3 colors, 2 texture coords multiplied by six faces and 4 vertices per face) multiplied by 4096
- the number of blocks in the chunk test.
Now in the real game, I'm not rendering the block faces that aren't exposed to air - but even if I do that in this test application, I'm still only rendering 16x16x16 blocks.
How can I better manage the VBO memory? My VBO test app rendering code, chunk code
At which point do VBOs shine they way everyone's been selling them as? lol
P.S. I guess I'll dive into instancing now, see how that helps.
Upvotes: 0
Views: 929
Reputation: 162317
Now in the real game, I'm not rendering the block faces that aren't exposed to air - but even if I do that in this test application, I'm still only rendering 16x16x16 blocks.
That's very good. I just mention this, because first and foremost, a common mistake people make when writing Minecraft style renderers is trying to send OpenGL all of the blocks (worse if they render them all). Instead you should determine which surfaces are actually visible and only keep those in a VBO. Using a spatial subdivision structure helps here. A Minecraft like world just reeks of Octrees, which can trivially store the actual thing. (This is for other people coming across this Q&A here).
Another thing to keep in mind is, that since everything is rendered as cubes you don't have to keep around millions of (identical) cubes, just translated. A single one suffices if you use instancing.
With instancing you need only 4 integers per block to fully describe it (3 for the position, 1 for the surface, which you load from a GL_TEXTURE_2D_ARRAY). Integers are preferable, because they come in smaller varieties (only drawback is, that older GPUs can't handle them efficiently). Say one cube represents 1m³. Then a 16 bit integer gives you a world of (65536m)³. Also you'll hardly need more than 256 kinds of surface for a Minecraft style game. So use an 8 bit integer to represent that. Then, when you only consider the visible surfaces, so blocks, a lot of your volume doesn't have to reside in the VBO.
This can be a real memory saver. Going from 32 bit floats to 16 bit integers saves you 50% of the memory. Using a single 8 bit integer for a material index instead of a 3× 32 bit float color cuts your memory demands for that to 1/12th.
To further decrease the rendering load you can exploit that all cubes in the world are parallel. This makes it embarrisingly easy to eliminate hidden surfaces from even being processed: There are 8 + 6 major directions from which to look at a cube: 8 where 3 faces sharing a corner are visible and 6 directions where only the surface you're looking on directly is visible. It's rather easy to determine the principal planes in the world where each case applies from the current point of view. So you have 14 variations of the cube base template and make instancing calls into each subvolume for a certain case. The octree again helps you to select which instances get which variant.
Another thing you should think of is cache coherency. The way your cube's data are arranged and aligned in the VBO and the order in which you access it is a huge deal. In general you want your data to be nicely aligned and coalesced (although with current multipath memory architectures separated data layouts can offer quite good performance as well). Your access pattern however should not "jump all over the place", so to speak. Keep your accesses nicely grouped. This is the major reason Display Lists can outperform VBOs: The contents are constant and the driver can rearrange their contents into an optimally aligned and ordered structure. You'll have to experiment with this.
To give you an idea about what current state of the art game engines do: A recent report from DICE stated that in the upcoming "Battlefield 4" game a single frame of even the most complex scenes is generated by no more than about 2000 drawing API calls. That's a really low number.
Upvotes: 3