GL_EXT_shader_samples_identical sent into the wild!

Yesterday I pushed an implementation of a new OpenGL extension GL_EXT_shader_samples_identical to Mesa. This extension will be in the Mesa 11.1 release in a few short weeks, and it will be enabled on various Intel platforms:

GEN7 (Ivy Bridge, Baytrail, Haswell): Only currently effective in the fragment shader. More details below.
GEN8 (Broadwell, Cherry Trail, Braswell): Only currently effective in the vertex shader and fragment shader. More details below.
GEN9 (Skylake): Only currently effective in the vertex shader and fragment shader. More details below.

The extension hasn't yet been published in the official OpenGL extension registry, but I will take care of that before Mesa 11.1 is released.

Background

Multisample anti-aliasing (MSAA) is a well known technique for reducing aliasing effects ("jaggies") in rendered images. The core idea is that the expensive part of generating a single pixel color happens once. The cheaper part of determining where that color exists in the pixel happens multiple times. For 2x MSAA this happens twice, for 4x MSAA this happens four times, etc. The computation cost is not increased by much, but the storage and memory bandwidth costs are increased linearly.

Some time ago, a clever person noticed that in areas where the whole pixel is covered by the triangle, all of the samples have exactly the same value. Furthermore, since most triangles are (much) bigger than a pixel, this is really common. From there it is trivial to apply some sort of simple data compression to the sample data, and all modern GPUs do this in some form. In addition to the surface that stores the data, there is a multisample control surface (MCS) that describes the compression.

On Intel GPUs, sample data is stored in n separate planes. For 4x MSAA, there are four planes. The MCS has a small table for each pixel that maps a sample to a plane. If the entry for sample 2 in the MCS is 0, then the data for sample 2 is stored in plane 0. The GPU automatically uses this to reduce bandwidth usage. When writing a pixel on the interior of a polygon (where all the samples have the same value), the MCS gets all zeros written, and the sample value is written only to plane 0.

This does add some complexity to the shader compiler. When a shader executes the texelFetch function, several things happen behind the scenes. First, an instruction is issued to read the MCS. Then a second instruction is executed to read the sample data. This second instruction uses the sample index and the result of the MCS read as inputs.

A simple shader like

    #version 150
    uniform sampler2DMS tex;
    uniform int samplePos;

    in vec2 coord;
    out vec4 frag_color;

    void main()
    {
       frag_color = texelFetch(tex, ivec2(coord), samplePos);
    }

generates this assembly

    pln(16)         g8<1>F          g7<0,1,0>F      g2<8,8,1>F      { align1 1H compacted };
    pln(16)         g10<1>F         g7.4<0,1,0>F    g2<8,8,1>F      { align1 1H compacted };
    mov(16)         g12<1>F         g6<0,1,0>F                      { align1 1H compacted };
    mov(16)         g16<1>D         g8<8,8,1>F                      { align1 1H compacted };
    mov(16)         g18<1>D         g10<8,8,1>F                     { align1 1H compacted };
    send(16)        g2<1>UW         g16<8,8,1>F
                                sampler ld_mcs SIMD16 Surface = 1 Sampler = 0 mlen 4 rlen 8 { align1 1H };
    mov(16)         g14<1>F         g2<8,8,1>F                      { align1 1H compacted };
    send(16)        g120<1>UW       g12<8,8,1>F
                                sampler ld2dms SIMD16 Surface = 1 Sampler = 0 mlen 8 rlen 8 { align1 1H };
    sendc(16)       null<1>UW       g120<8,8,1>F
                                render RT write SIMD16 LastRT Surface = 0 mlen 8 rlen 0 { align1 1H EOT };

The ld_mcs instruction is the read from the MCS, and the ld2dms is the read from the multisample surface using the MCS data. If a shader reads multiple samples from the same location, the compiler will likely eliminate all but one of the ld_mcs instructions.

Modern GPUs also have an additional optimization. When an application clears a surface, some values are much more commonly used than others. Permutations of 0s and 1s are, by far, the most common. Bandwidth usage can further be reduced by taking advantage of this. With a single bit for each of red, green, blue, and alpha, only four bits are necessary to describe a clear color that contains only 0s and 1s. A special value could then be stored in the MCS for each sample that uses the fast-clear color. A clear operation that uses a fast-clear compatible color only has to modify the MCS.

All of this is well documented in the Programmer's Reference Manuals for Intel GPUs.

There's More

Information from the MCS can also help users of the multisample surface reduce memory bandwidth usage. Imagine a simple, straight forward shader that performs an MSAA resolve operation:

    #version 150
    uniform sampler2DMS tex;

    #define NUM_SAMPLES 4 // generate a different shader for each sample count

    in vec2 coord;
    out vec4 frag_color;

    void main()
    {
        vec4 color = texelFetch(tex, ivec2(coord), 0);

        for (int i = 1; i < NUM_SAMPLES; i++)
            color += texelFetch(tex, ivec2(coord), i);

        frag_color = color / float(NUM_SAMPLES);
    }

The problem should be obvious. On most pixels all of the samples will have the same color, but the shader still reads every sample. It's tempting to think the compiler should be able to fix this. In a very simple cases like this one, that may be possible, but such an optimization would be both challenging to implement and, likely, very easy to fool.

A better approach is to just make the data available to the shader, and that is where this extension comes in. A new function textureSamplesIdenticalEXT is added that allows the shader to detect the common case where all the samples have the same value. The new, optimized shader would be:

    #version 150
    #extension GL_EXT_shader_samples_identical: enable
    uniform sampler2DMS tex;

    #define NUM_SAMPLES 4 // generate a different shader for each sample count

    #if !defined GL_EXT_shader_samples_identical
    #define textureSamplesIdenticalEXT(t, c)  false
    #endif

    in vec2 coord;
    out vec4 frag_color;

    void main()
    {
        vec4 color = texelFetch(tex, ivec2(coord), 0);

        if (! textureSamplesIdenticalEXT(tex, ivec2(coord)) {
            for (int i = 1; i < NUM_SAMPLES; i++)
                color += texelFetch(tex, ivec2(coord), i);

            color /= float(NUM_SAMPLES);
        }

        frag_color = color;
    }

The intention is that this function be implemented by simply examining the MCS data. At least on Intel GPUs, if the MCS for a pixel is all 0s, then all the samples are the same. Since textureSamplesIdenticalEXT can reuse the MCS data read by the first texelFetch call, there are no extra reads from memory. There is just a single compare and conditional branch. These added instructions can be scheduled while waiting for the ld2dms instruction to read from memory (slow), so they are practically free.

It is also tempting to use textureSamplesIdenticalEXT in conjunction with anyInvocationsARB (from GL_ARB_shader_group_vote). Such a shader might look like:

    #version 430
    #extension GL_EXT_shader_samples_identical: require
    #extension GL_ARB_shader_group_vote: require
    uniform sampler2DMS tex;

    #define NUM_SAMPLES 4 // generate a different shader for each sample count

    in vec2 coord;
    out vec4 frag_color;

    void main()
    {
        vec4 color = texelFetch(tex, ivec2(coord), 0);

        if (anyInvocationsARB(!textureSamplesIdenticalEXT(tex, ivec2(coord))) {
            for (int i = 1; i < NUM_SAMPLES; i++)
                color += texelFetch(tex, ivec2(coord), i);

            color /= float(NUM_SAMPLES);
        }

        frag_color = color;
    }

Whether or not using anyInvocationsARB improves performance is likely to be dependent on both the shader and the underlying GPU hardware. Currently Mesa does not support GL_ARB_shader_group_vote, so I don't have any data one way or the other.

Caveats

The implementation of this extension that will ship with Mesa 11.1 has a three main caveats. Each of these will likely be resolved to some extent in future releases.

The extension is only effective on scalar shader units. This means on GEN7 it is effective in fragment shaders. On GEN8 and GEN9 it is only effective in vertex shaders and fragment shaders. It is supported in all shader stages, but in non-scalar stages textureSamplesIdenticalEXT always returns false. The implementation for the non-scalar stages is slightly different, and, on GEN9, the exact set of instructions depends on the number of samples. I didn't think it was likely that people would want to use this feature in a vertex shader or geometry shader, so I just didn't finish the implementation. This will almost certainly be resolved in Mesa 11.2.

The current implementation also returns a false negative for texels fully set to the fast-clear color. There are two problems with the fast-clear color. It uses a different value than the "all plane 0" case, and the size of the value depends on the number of samples. For 2x MSAA, the MCS read returns 0x000000ff, but for 8x MSAA it returns 0xffffffff.

The first problem means the compiler would needs to generate additional instructions to check for "all plane 0" or "all fast-clear color." This could hurt the performance of applications that either don't use a fast-clear color or, more likely, that later draw non-clear data to the entire surface. The second problem means the compiler would needs to do state-based recompiles when the number of samples changes.

In the end, we decided that "all plane 0" was by far the most common case, so we have ignored the "all fast-clear color" case for the time being. We are still collecting data from applications, and we're working on several uses of this functionality inside our driver. In future versions we may implement a heuristic to determine whether or not to check for the fast-clear color case.

As mentioned above, Mesa does not currently support GL_ARB_shader_group_vote. Applications that want to use textureSamplesIdenticalEXT on Mesa will need paths that do not use anyInvocationsARB for at least the time being.

Future

As stated by issue #3, the extension still needs to gain SPIR-V support. This extension would be just as useful in Vulkan and OpenCL as it is in OpenGL.

At some point there is likely to be a follow-on extension that provides more MCS data to the shader in a more raw form. As stated in issue #2 and previously in this post, there are a few problems with providing raw MCS data. The biggest problem is how the data is returned. Each sample count needs a different amount of data. Current 8x MSAA surfaces have 32-bits (returned) per pixel. Current 16x MSAA MCS surfaces have 64-bits per pixel. Future 32x MSAA, should that ever exist, would need 192 bits. Additionally, there would need to be a set of new texelFetch functions that take both a sample index and the MCS data. This, again, has problems with variable data size.

Applications would also want to query things about the MCS values. How many times is plane 0 used? Which samples use plane 2? What is the highest plane used? There could be other useful queries. I can imagine that a high quality, high performance multisample resolve filter could want all of this information. Since the data changes based on the sample count and could change on future hardware, the future extension really should not directly expose the encoding of the MCS data. How should it provide the data? I'm expecting to write some demo applications and experiment with a bunch of different things. Obviously, this is an open area of research.