Friday, May 22, 2015

Underestanding PowerVR GPUs via Metal

In my previous post I suggested that OpenGL and OpenGL ES, as APIs, don't always fit the underlying hardware. One way to understand this is to read GPU hardware documentation - AMD is pretty good about posting hardware specs, e.g. ISAs, register listings, etc. You can also read the extensions and see the IHV trying to bend the API to be closer to the hardware (see NVidia's big pile of bindless this and bindless that). But both these ways of "studying" the hardware are time consuming and not practical if you don't do 3-d graphics full time.

Recently there has been a flood of new low-level, close-to-the-hardware APIs: metal (Apple, PowerVR), Mantle (AMD, GCN), Vulkan (Khronos, everything), DirectX 12 (Microsoft, desktop GPUs). This provides us another way of understanding the hardware: we can look at what the graphics API would look like if it were rewritten to match today's hardware.

Let's take a look at some Metal APIs and see what they tell us about the PowerVR hardware.

Mutability Is Expensive

A texture in Metal is referenced via an MTLTexture object.* Note that while it has properties to get its dimensions, there is no API to change its size! Instead you have to fill in a new MTLTextureDescriptor and use that to make a brand new MTLTexture object.

In graphics terms, the texture is immutable. You can change the contents of its image, but you can't change the object itself in such a way that the underlying hardware resources and shader instructions associated with the texture have to be altered.

This is a win for the driver: when you go to use an MTLTexture, whatever was true about the texture last time you used it is still true now, always.

Compare this to OpenGL. With OpenGL, you can bind the texture id to a new texture - not only with different dimensions, but maybe of a totally different type. Surprise, OpenGL - that 2-d texture I used is now a cube map! Because anything can change at any time, OpenGL has to track mutations and re-check the validity of bound state when you draw.

Commands Are Assembled in Command Buffers and Then Queued for the GPU

How do your OpenGL commands actually get to the GPU? The OpenGL way involves a fair amount of witchcraft:
  1. You make an OpenGL context current to a thread.
  2. You issue function calls into the OpenGL API.
  3. "Later" stuff happens. If you never call glFlush, glFinish, or some kind of swap command, maybe some of your commands never execute.
That's definitely not how the hardware works. Again, Metal gives us a view of the underlying implementation.

On every modern GPU where I've been able to find out how command processing works, the GPU follows pretty much the same design:
  1. The driver fills in a command buffer - that is, a block of memory with GPU commands (typically a few bytes each) that tell the GPU what to do. The GPU commands don't match the source API - there will typically be commands for draw calls, setting up registers on the GPU, and that might be it.
  2. The driver queues completed command buffers for the GPU to run in some kind of order. The GPU might DMA the command buffer into its own space, or it might read it out of system memory.
Metal exposes this directly: MTLCommandBuffer represents a single command buffer, and MTLCommandQueue is where you queue it once you're done encoding it and you want the GPU to operate on it.

It turns out a fair amount of the CPU time the driver spends goes into converting your OpenGL commands into command buffers.  Metal exposes this too via specific MTLCommandEncoder subclasses. We can now see this work directly.

When you issue OpenGL commands, the encoder is built into the context, is "discovered" via your current thread, and commands are sent to a command buffer that is allocated on the fly. (If you really push the API hard, some OpenGL implementations can block in random locations because the context's encoder can't get a command buffer.)

The OpenGL context also has access to a queue internally, and will queue your buffer when (1) it fills up or (2) you call one of glFlush/glFinish/swap. This is why your commands might not start executing until you call flush - if the buffer isn't full, OpenGL will leave it around, waiting for more commands.

One last note: the race condition between the CPU writing commands and the GPU reading them is handled by a buffer being in only one place at a time, whether it's the CPU (encoding commands) or GPU (executing them) - this is true for both Metal and GLES. So while you are encoding commands, the GPU has not started on them yet.

Normally this is not a problem - you queue up a ton of work and the GPU always has a long todo list. But in the non-ideal case where GPU latency matters (e.g. you want the answer as fast as possible), in OpenGL ES you might have to issue a flush so the GPU can start - OpenGL will then get you a new command buffer. (This is why the GL spec has all of that language about glFlush ensuring that commands will complete in finite time - until you flush, the command buffer is just sitting there waiting for the driver to add more to it.)

The GPU Does Work When You Start and End Rendering to a Surface

As I am sure you have read 1000 times, the PowerVR GPUs are tiled deferred renderers. What this means is that rasterizing and fragment shading are done on tiny 32x32 pixel tiles of the screen, one at at time. (The tile size might be different - I haven't found a good reference.) For each rendering pass, the GPU iterates on each tile of the surface and renders everything in the rendering pass that intersects that tile.

The PowerVR GPUs are designed this way so that they can function without high-speed VRAM tied to a high-bandwidth memory bus. Normal desktop GPUs use a ton of memory bandwidth, and that's a source of power consumption.  The PowerVR GPUs have a tiny amount of on-chip video memory; for each tile the surface is loaded into this cache, fully shaded (with multiple primitives) and then saved back out to shared memory (e.g. the surface itself).**

This means the driver has to understand the bounded set of drawing operations that occur for a single surface, book-ended by a start and end. The driver also has to understand the life-cycle of this rendering pass: do we need to load the surface from memory to modify it, or can we just clear it and draw? What results actually need to be saved?  (You probably need your color buffer when you're done drawing, but maybe not the depth buffer. If depth was just used for hidden surface removal, you can skip saving it to memory.) Optimizing the start and end of a surface rendering pass saves a ton of bandwidth.

Metal lets you specify how a rendering pass will work explicitly: an MTLRenderPassDescriptor describes the surfaces you will render to and exactly how you want them to be loaded and stored. You can explicitly specify that the surface be loaded from memory, cleared, or whatever is fastest; you can also explicitly store the surface, use it for an FSAA resolve, or discard it.

To get a command encoder to render (a MTLRenderCommandEncoder), you have to pass a MTLRenderPassDescriptor describing how a pass is book-ended and what surfaces are involved. You can't not answer the question.

Compare this to OpenGL ES; when you bind a new surface for drawing, the driver must note that it doesn't know how you want the pass started. It then has to track any drawing operation (which will implicitly load the surface from memory) as well as a clear operation (which will start by clearing). Lots of book-keeping.

The Entire Pipeline Is Grafted Onto Your Shader

OpenGL encourages us to think of the format of our vertex data as being part of the vertex data, because we use glVertexAttribPointer to tell OpenGL how our vertices are read from a VBO.

This view of vertex fetching is misleading; glVertexAttribPointer really wraps up two very different bits of information:

  • Where to get the raw vertex data (we need to know the VBO binding and base pointer) and
  • How to fetch and interpret that data (for which we need to know the data type, stride, and whether normalization is desired).
The trend in recent years is for GPUs to do vertex fetch "in software" as part of the vertex shader, rather than have fixed function hardware or registers that do the fetch. Moving vertex fetch to software is a win because the hardware already has to support fast streamed cached reads for compute applications, so some fixed function transistors can be thrown overboard to make room for more shader cores.

On the desktop, blending is still fixed function, but on the Power VR, blending and write-out to the framebuffer is done in the shader as well.  (For a really good explanation of why blending hasn't gone programmable on the desktop, read this.  Since the currently rendered tile is cached on chip on the PowerVR, you can see why the arguments about latency and bandwidth from desktop don't apply here, making blend-in-shader a reasonable idea.)

The sum of these two facts is: your shader actually contains a bunch of extra code, generated by the driver, on both the front and back.

Metal exposes this directly with a single object: MTLRenderPipelineState. This object wraps up the actual complete GPU pipeline with all of the "extra" stuff included that you wouldn't know about in OpenGL. Like most GPU objects, the pipeline state is immutable and is created with a separate MTLRenderPipelineDescriptor object. We can see from the descriptor that the pipeline locks down not only the vertex and fragment functions, but also the vertex format and anti-aliasing properties for rasterization. Color mask and blending is in the color attachment descriptor, so that's part of the pipeline too.

Every time you change the vertex format (or even pretend to by changing the vertex base pointer with glVertexAttribPointer), every time you change the color write mask, or change blending, you're requiring a new underlying pipeline to be built for your GLSL shader. Metal exposes the actual pipeline, allowing for greater efficiency. (In X-Plane, for example, we always tie blending state to the shader, so a pipeline is a pretty good fit.)

If there's a summary here, it's that GLES doesn't quite match the PowerVR chip, and we can see the mismatch by looking at Metal. In almost all cases, the driver has to do more work to make GLES fit the hardware, inferring and guessing the semantics of our application.

I'll do one more post in this series, looking at Mantle, and some of the terrifying things we've never had to worry about when running OpenGL on AMD's GCN architecture.

* Technically the real API objects are all ObjC protocols, while the lighter-weight struct-like entities are objects. I'll call them all objects here - to client code they might as well be. The fact that API-created objects are protocols stops you from trying to alloc/init them.

** Besides saving bus bandwidth, this technique also saves shading ops. Because the renderer ha access to the entire rendering pass before it fills in a tile, it can re-order opaque triangles for perfect front-to-back rendering, leveraging early Z rejection.