Yesterday I pushed an implementation of a new OpenGL extension
GL_EXT_shader_samples_identical
to Mesa. This extension will be in the Mesa 11.1 release in a few
short
weeks,
and it will be enabled on various Intel platforms:
- GEN7 (Ivy Bridge, Baytrail, Haswell): Only currently effective in the fragment shader. More details below.
- GEN8 (Broadwell, Cherry Trail, Braswell): Only currently effective in the vertex shader and fragment shader. More details below.
- GEN9 (Skylake): Only currently effective in the vertex shader and fragment shader. More details below.
The extension hasn't yet been published in the official OpenGL extension registry, but I will take care of that before Mesa 11.1 is released.
Background
Multisample anti-aliasing (MSAA) is a well known technique for reducing aliasing effects ("jaggies") in rendered images. The core idea is that the expensive part of generating a single pixel color happens once. The cheaper part of determining where that color exists in the pixel happens multiple times. For 2x MSAA this happens twice, for 4x MSAA this happens four times, etc. The computation cost is not increased by much, but the storage and memory bandwidth costs are increased linearly.
Some time ago, a clever person noticed that in areas where the whole pixel is covered by the triangle, all of the samples have exactly the same value. Furthermore, since most triangles are (much) bigger than a pixel, this is really common. From there it is trivial to apply some sort of simple data compression to the sample data, and all modern GPUs do this in some form. In addition to the surface that stores the data, there is a multisample control surface (MCS) that describes the compression.
On Intel GPUs, sample data is stored in n separate planes. For 4x MSAA, there are four planes. The MCS has a small table for each pixel that maps a sample to a plane. If the entry for sample 2 in the MCS is 0, then the data for sample 2 is stored in plane 0. The GPU automatically uses this to reduce bandwidth usage. When writing a pixel on the interior of a polygon (where all the samples have the same value), the MCS gets all zeros written, and the sample value is written only to plane 0.
This does add some complexity to the shader compiler. When a shader executes
the texelFetch
function, several things happen behind the scenes. First, an instruction is
issued to read the MCS. Then a second instruction is executed to read the
sample data. This second instruction uses the sample index and the result of
the MCS read as inputs.
A simple shader like
#version 150
uniform sampler2DMS tex;
uniform int samplePos;
in vec2 coord;
out vec4 frag_color;
void main()
{
frag_color = texelFetch(tex, ivec2(coord), samplePos);
}
generates this assembly
pln(16) g8<1>F g7<0,1,0>F g2<8,8,1>F { align1 1H compacted };
pln(16) g10<1>F g7.4<0,1,0>F g2<8,8,1>F { align1 1H compacted };
mov(16) g12<1>F g6<0,1,0>F { align1 1H compacted };
mov(16) g16<1>D g8<8,8,1>F { align1 1H compacted };
mov(16) g18<1>D g10<8,8,1>F { align1 1H compacted };
send(16) g2<1>UW g16<8,8,1>F
sampler ld_mcs SIMD16 Surface = 1 Sampler = 0 mlen 4 rlen 8 { align1 1H };
mov(16) g14<1>F g2<8,8,1>F { align1 1H compacted };
send(16) g120<1>UW g12<8,8,1>F
sampler ld2dms SIMD16 Surface = 1 Sampler = 0 mlen 8 rlen 8 { align1 1H };
sendc(16) null<1>UW g120<8,8,1>F
render RT write SIMD16 LastRT Surface = 0 mlen 8 rlen 0 { align1 1H EOT };
The ld_mcs
instruction is the read from the MCS, and the ld2dms
is the
read from the multisample surface using the MCS data. If a shader reads
multiple samples from the same location, the compiler will likely eliminate
all but one of the ld_mcs
instructions.
Modern GPUs also have an additional optimization. When an application clears a surface, some values are much more commonly used than others. Permutations of 0s and 1s are, by far, the most common. Bandwidth usage can further be reduced by taking advantage of this. With a single bit for each of red, green, blue, and alpha, only four bits are necessary to describe a clear color that contains only 0s and 1s. A special value could then be stored in the MCS for each sample that uses the fast-clear color. A clear operation that uses a fast-clear compatible color only has to modify the MCS.
All of this is well documented in the Programmer's Reference Manuals for Intel GPUs.
There's More
Information from the MCS can also help users of the multisample surface reduce memory bandwidth usage. Imagine a simple, straight forward shader that performs an MSAA resolve operation:
#version 150
uniform sampler2DMS tex;
#define NUM_SAMPLES 4 // generate a different shader for each sample count
in vec2 coord;
out vec4 frag_color;
void main()
{
vec4 color = texelFetch(tex, ivec2(coord), 0);
for (int i = 1; i < NUM_SAMPLES; i++)
color += texelFetch(tex, ivec2(coord), i);
frag_color = color / float(NUM_SAMPLES);
}
The problem should be obvious. On most pixels all of the samples will have the same color, but the shader still reads every sample. It's tempting to think the compiler should be able to fix this. In a very simple cases like this one, that may be possible, but such an optimization would be both challenging to implement and, likely, very easy to fool.
A better approach is to just make the data available to the shader, and that
is where this extension comes in. A new function textureSamplesIdenticalEXT
is added that allows the shader to detect the common case where all the
samples have the same value. The new, optimized shader would be:
#version 150
#extension GL_EXT_shader_samples_identical: enable
uniform sampler2DMS tex;
#define NUM_SAMPLES 4 // generate a different shader for each sample count
#if !defined GL_EXT_shader_samples_identical
#define textureSamplesIdenticalEXT(t, c) false
#endif
in vec2 coord;
out vec4 frag_color;
void main()
{
vec4 color = texelFetch(tex, ivec2(coord), 0);
if (! textureSamplesIdenticalEXT(tex, ivec2(coord)) {
for (int i = 1; i < NUM_SAMPLES; i++)
color += texelFetch(tex, ivec2(coord), i);
color /= float(NUM_SAMPLES);
}
frag_color = color;
}
The intention is that this function be implemented by simply examining the MCS
data. At least on Intel GPUs, if the MCS for a pixel is all 0s, then all the
samples are the same. Since textureSamplesIdenticalEXT
can reuse the MCS
data read by the first texelFetch
call, there are no extra reads from
memory. There is just a single compare and conditional branch. These added
instructions can be scheduled while waiting for the ld2dms
instruction to
read from memory (slow), so they are practically free.
It is also tempting to use textureSamplesIdenticalEXT
in conjunction with
anyInvocationsARB
(from
GL_ARB_shader_group_vote
).
Such a shader might look like:
#version 430
#extension GL_EXT_shader_samples_identical: require
#extension GL_ARB_shader_group_vote: require
uniform sampler2DMS tex;
#define NUM_SAMPLES 4 // generate a different shader for each sample count
in vec2 coord;
out vec4 frag_color;
void main()
{
vec4 color = texelFetch(tex, ivec2(coord), 0);
if (anyInvocationsARB(!textureSamplesIdenticalEXT(tex, ivec2(coord))) {
for (int i = 1; i < NUM_SAMPLES; i++)
color += texelFetch(tex, ivec2(coord), i);
color /= float(NUM_SAMPLES);
}
frag_color = color;
}
Whether or not using anyInvocationsARB
improves performance is likely to be
dependent on both the shader and the underlying GPU hardware. Currently Mesa
does not support
GL_ARB_shader_group_vote
,
so I don't have any data one way or the other.
Caveats
The implementation of this extension that will ship with Mesa 11.1 has a three main caveats. Each of these will likely be resolved to some extent in future releases.
The extension is only effective on scalar shader units. This means on GEN7 it
is effective in fragment shaders. On GEN8 and GEN9 it is only effective in
vertex shaders and fragment shaders. It is supported in all shader stages,
but in non-scalar stages textureSamplesIdenticalEXT
always returns false
.
The implementation for the non-scalar stages is slightly different, and, on
GEN9, the exact set of instructions depends on the number of samples. I
didn't think it was likely that people would want to use this feature in a
vertex shader or geometry shader, so I just didn't finish the implementation.
This will almost certainly be resolved in Mesa 11.2.
The current implementation also returns a false negative for texels fully set
to the fast-clear color. There are two problems with the fast-clear color.
It uses a different value than the "all plane 0" case, and the size of the
value depends on the number of samples. For 2x MSAA, the MCS read returns
0x000000ff
, but for 8x MSAA it returns 0xffffffff
.
The first problem means the compiler would needs to generate additional instructions to check for "all plane 0" or "all fast-clear color." This could hurt the performance of applications that either don't use a fast-clear color or, more likely, that later draw non-clear data to the entire surface. The second problem means the compiler would needs to do state-based recompiles when the number of samples changes.
In the end, we decided that "all plane 0" was by far the most common case, so we have ignored the "all fast-clear color" case for the time being. We are still collecting data from applications, and we're working on several uses of this functionality inside our driver. In future versions we may implement a heuristic to determine whether or not to check for the fast-clear color case.
As mentioned above, Mesa does not currently support
GL_ARB_shader_group_vote
.
Applications that want to use textureSamplesIdenticalEXT
on Mesa will
need paths that do not use anyInvocationsARB
for at least the time being.
Future
As stated by issue #3, the extension still needs to gain SPIR-V support. This extension would be just as useful in Vulkan and OpenCL as it is in OpenGL.
At some point there is likely to be a follow-on extension that provides more
MCS data to the shader in a more raw form. As stated in issue #2 and
previously in this post, there are a few problems with providing raw MCS data.
The biggest problem is how the data is returned. Each sample count needs a
different amount of data. Current 8x MSAA surfaces have 32-bits (returned)
per pixel. Current 16x MSAA MCS surfaces have 64-bits per pixel. Future 32x
MSAA, should that ever exist, would need 192 bits. Additionally, there would
need to be a set of new texelFetch
functions that take both a sample index
and the MCS data. This, again, has problems with variable data size.
Applications would also want to query things about the MCS values. How many times is plane 0 used? Which samples use plane 2? What is the highest plane used? There could be other useful queries. I can imagine that a high quality, high performance multisample resolve filter could want all of this information. Since the data changes based on the sample count and could change on future hardware, the future extension really should not directly expose the encoding of the MCS data. How should it provide the data? I'm expecting to write some demo applications and experiment with a bunch of different things. Obviously, this is an open area of research.