Home

Blog

Bake AO

About

Contact

Home

Blog

Bake AO

About

Contact

Blog

GPU Sorting Algorithms - Benchmarks

Jul 31, 2025

15 min

In this article 1. I will explain where I've used sorting algorithms in practice. 2. I will explain how to improve bloom quality by implementing a median filter, which uses sorting. 3. I will benchmark many sorting algorithms and compare their performance. I will test all of them on the bloom prefiltering example. 4. Summary. I assume that you are familiar with writing custom shader code and with basic sorting algorithms (bubble sort, selection sort, insertion sort).

Where GPU sorting is used

I personally used sorting in my shaders for two effects: denoising and terrain rendering.

Denoising:

The simplest denoising algorithm is the median filter, which I implemented by gathering neighboring pixel values, sorting them by brightness, and outputting the median value.

I applied this denoising method to baked textures, bloom prefiltering, and other visual effects.

:center-px:

Denoising baked textures significantly reduces many baking artifacts.

:center-px:

Denoising also helps reduce bloom flicker, which I explain further in the next section.

Terrain renderer

I also used sorting algorithms a couple of times to filter the most important textures to sample, typically in terrain rendering.

Imagine you have a terrain with many texture layers. Each layer has its own set of textures (albedo, normal, specular, etc.), and all these textures are stored in an array. Splat textures, known as splatmaps, control the alpha value (opacity or blend amount) of those layers.

To render one layer, I needed to sample all its textures and apply them using the alpha value stored in the splatmap.

:center-px:

Visualisation of how splatmap rendering works. Each channel in the splatmap texture controls the alpha value of one layer. Source images were yoinked from a brilliant Catlikecoding tutorial: https://catlikecoding.com/unity/tutorials/rendering/part-3/

If the terrain contains 16 layers, it wastes resources to sample and calculate all of them for each pixel when most layers are not visible.

This is where sorting kicks in. I used it to identify the four most important layers. I gathered all the splatmap values, sorted them, and only sampled textures for 4 layers.

Instead of calculating 16 layers, I could simply sample the 4 most important ones.

:image-description:

I used this optimization in the Knights of Honor 2: Sovereign.

Bloom prefiltering

I will now demonstrate how I resolved the bloom flickering issue using a median filter and then benchmark some popular sorting algorithms.

Bloom 101

Before I dive into sorting, I need to explain how the bloom works.

Bloom is the postprocess effect that floods the bright pixels into neighboring ones, effectively increasing perceived brightness of the image.

:center-px:

:image-description:

Original image on the left, bloom effect applied on the right.

The first stage of the bloom effect is to prefilter the source texture. This stage identifies pixels with brightness above a certain threshold.

:center-px:

:image-description:

Original texture on the left, prefiltered bright pixels on the right.

The next stage is to blur the prefiltered texture. This is where the "flooding" happens. Blurring the texture usually uses a pyramid buffer. It means that it downscales the source texture in a couple of steps and then upscales it back, each time blurring the contents a little.

:image-description:

This is how the pyramid buffer blur looks.

At the end, the output texture is applied to the screen using additive blending.

:center-px:

The original texture is combined with the blurred output to create a final image.

I prepared a scene with sharp specular reflections and created a script that animates the camera slightly. Unity's built-in bloom flickers when small, bright pixels appear and disappear quickly. I will show how I fixed this using a median filter.

Editing Unity's bloom shader

The project I use contains URP, and the bloom shader is inside the URP package. To modify it, I will move the URP folder from Library/Package Cache into Packages.

Now I can find the Bloom shader in the Unity project browser, and I can edit it.

:center-px:

Finding the prefiltering function

The first step is to find the code responsible for prefiltering.

I found this code fragment. It seems to be responsible for prefiltering the bloom texture.

half4 FragPrefilter(Varyings input) : SV_Target  
{  
  ...
  #if _BLOOM_HQ  
  half3 A = SamplePrefilter(uv, float2(-1.0, -1.0));  
  half3 B = SamplePrefilter(uv, float2( 0.0, -1.0));  
  half3 C = SamplePrefilter(uv, float2( 1.0, -1.0));  
  half3 D = SamplePrefilter(uv, float2(-0.5, -0.5));  
  half3 E = SamplePrefilter(uv, float2( 0.5, -0.5));  
  half3 F = SamplePrefilter(uv, float2(-1.0,  0.0));  
  half3 G = SamplePrefilter(uv, float2( 0.0,  0.0));  
  half3 H = SamplePrefilter(uv, float2( 1.0,  0.0));  
  half3 I = SamplePrefilter(uv, float2(-0.5,  0.5));  
  half3 J = SamplePrefilter(uv, float2( 0.5,  0.5));  
  half3 K = SamplePrefilter(uv, float2(-1.0,  1.0));  
  half3 L = SamplePrefilter(uv, float2( 0.0,  1.0));  
  half3 M = SamplePrefilter(uv, float2( 1.0,  1.0));
  half2 div = (1.0 / 4.0) * half2(0.5, 0.125);
  half3 color = (D + E + I + J) * div.x;  
  color += (A + B + G + F) * div.y;  
  color += (B + C + H + G) * div.y;  
  color += (F + G + L + K) * div.y;  
  color += (G + H + M + L) * div.y;  
  #else  
  // I will edit the code here  
  half3 color = SamplePrefilter(uv, float2(0,0));  
  #endif  
  ...

The SamplePrefilter function takes the UV (texture coordinate) and a pixel offset, making it ideal for sampling surrounding pixels. Additionally, a different code is executed when _BLOOM_HQ is defined. This is a High Quality Filtering setting for Bloom that changes the filtering behavior.

:center-px:

I will check if that's the right spot by offsetting the UV to see if the bloom effect is misplaced on the screen as a result.

:image-description:

Now that I'm sure I'm in the correct place, I can start to implement the median filter.

Median filter

Now I will implement a simple median filter by sampling 9 neighboring pixels and sorting them using the bubble sort algorithm.

The median filter returns the pixel with median brightness among its neighbors.

First, I will define a separate method that will sample the source texture using a median filter.

Then I will implement its contents. For now, I will just return the texture color.

half3 SampleMedianPrefilter(float2 uv)  
{  
    // I will implement a median filter here later, but for now, I will just return a sampled color:  
    return SamplePrefilter(uv, float2(0.0, 0.0));  
}

#else  
    // This is how I will use it.  
    half3 color = SampleMedianPrefilter(uv);  
#endif

I prepared a sampling kernel, which specifies the relative positions of all neighboring pixels to sample. I then used a for-loop to sample neighboring pixels and save their brightness (luminance) into the alpha channel of the color.

At the end, I have 9 sampled values. Each value contains color in RGB channels and brightness in the A channel.

// All the samples the median kernel should sample.  
static const int medianKernelLength = 9;  
static half2 medianKernel[medianKernelLength] =  
{  
	half2(-1.0, -1.0),  
	half2(-1.0, 0.0),  
	half2(-1.0, 1.0),  
	half2(0.0, -1.0),  
	half2(0.0, 0.0),  
	half2(0.0, 1.0),  
	half2(1.0, -1.0),  
	half2(1.0, 0.0),  
	half2(1.0, 1.0)  
};

// Returns the luminance of the color by getting the maximum value of this color.  
half GetLuminance(half3 color)  
{  
	return max(color.r, max(color.g, color.b));  
}

half3 SampleMedianPrefilter(float2 uv)  
{  
	// Sampling all the neighbors and saving them in an array.  
	half4 values[medianKernelLength];
	
	[unroll]  
	for (int i = 0; i < medianKernelLength; i++)  
	{  
		values[i].rgb = SamplePrefilter(uv, medianKernel[i]);  
		values[i].a = GetLuminance(values[i].rgb); // Saving luminance into alpha channel. I will use it for sorting.  
	}  
}

Now it's time to implement sorting. For now, I will use a bubble sort

half3 SampleMedianPrefilter(float2 uv)  
{  
	// Sampling all the neighbors and saving them in an array.  
	half4 values[medianKernelLength];
	
	[unroll]  
	for (int i = 0; i < medianKernelLength; i++)  
	{  
		values[i].rgb = SamplePrefilter(uv, medianKernel[i]);  
		values[i].a = GetLuminance(values[i].rgb); // Saving luminance into alpha channel. I will use it for sorting.  
	}
	
	// Bubble sort by color brightness  
	for (int k = 0; k < medianKernelLength - 1; k++)  
	{  
		for (int j = 0; j < medianKernelLength - k - 1; j++)  
		{  
			if (values[j].a > values[j + 1].a)  
			{  
				half4 tmp = values[j];  
				values[j] = values[j + 1];  
				values[j + 1] = tmp;  
			}  
		}  
	}
	
	// Return median after sorting  
	return values[medianKernelLength / 2].rgb;  
}

Let's see the results!

:center-px:

:image-description:

Switching between the custom median filter and the original shader.

And let's profile the results! I compiled the project and profiled it in FullHD on RTX 3060.

:center-px:

This sorting algorithm took 2.13ms. For 60FPS standard, that's 12.7% of the frame budget. Not good. I need to find something better.

___

Sorting algorithms - performance benchmarks

As you saw, sorting just 9 elements on the GPU is too costly to keep it as is, so I need to figure out how to improve its performance.

To begin with performance comparison, I will create a baseline profile.

:center-px:

:image-description:

This is the original bloom prefilter performance. 0.03ms per frame. I will compare various sorting algorithms to this one.

In this section, I will conduct a series of benchmarks on sorting algorithms.

What sorting algorithms can I implement?

A GPU processor doesn't have a program execution stack. And all the functions used in the shader are inlined - so it is not possible to use any sorting algorithm that requires recursion, like quicksort or mergesort.

So I'm left with simple sorting algorithms:

bubble sort
insertion sort
selection sort
network sort

I will test them and measure the performance of each one.

Each sorting algorithm will be benchmarked in two versions. With unrolling and without unrolling.

Unroll is a compiler hint that tells it to expand the for loop during compile-time. This way, only the outcomes for each loop iteration are compiled, skipping the usual loop control instructions.

Unrolled code looks like this:

[unroll] // Hint to unroll the loop during the compilation  
for (int k = 0; k < medianKernelLength - 1; k++)  
{  
   [unroll] // Hint to unroll the loop during the compilation  
   for (int j = 0; j < medianKernelLength - k - 1; j++)  
   {  
       [flatten] // Hint to not branch the 'if'  
       if (values[j].a > values[j + 1].a)  
       {  
           half4 tmp = values[j];  
           values[j] = values[j + 1];  
           values[j + 1] = tmp;  
       }  
   }  
}

And this is non-unrolled code:

for (int k = 0; k < medianKernelLength - 1; k++)  
{  
   for (int j = 0; j < medianKernelLength - k - 1; j++)  
   {  
       if (values[j].a > values[j + 1].a)  
       {  
           half4 tmp = values[j];  
           values[j] = values[j + 1];  
           values[j + 1] = tmp;  
       }  
   }  
}

Bubble sort

Bubble sort is known for being one of the slowest sorting algorithms ever created. But you will quickly see that it's quite strong on the GPU.

It works by traversing the whole array and comparing and swapping all neighboring elements each time.

for (int k = 0; k < medianKernelLength - 1; k++)  
{  
   for (int j = 0; j < medianKernelLength - k - 1; j++)  
   {  
       if (values[j].a > values[j + 1].a)  
       {  
           half4 tmp = values[j];  
           values[j] = values[j + 1];  
           values[j + 1] = tmp;  
       }  
   }  
}

Non-Unrolled: 2.13ms

Unrolled: 0.06ms

Yes, it is not a mistake. I couldn't believe my eyes, I checked 3 times.

Unrolled bubble sort is 35x faster!

Insertion sort

Let's move forward. Insertion sort is the next one. Known as the fastest sorting algorithm on the CPU for small sets. It works by picking the unsorted element and placing it within a sorted part.

// Insertion sort by alpha  
for (int k = 1; k < medianKernelLength; k++)  
{  
    half4 key = values[k];  
    int j = k - 1;
  
    // Manual shift loop  
    while (j >= 0 && values[j].a > key.a)  
    {  
        values[j + 1] = values[j];  
        j--;  
    }  
  
    values[j + 1] = key;  
}

Non-Unrolled: 2.38ms

Unrolled: 1.67ms

It appears that insertion sort has limited unrolling capabilities. The `while` loop inside is not easy for the compiler to predict during compile-time.

Selection sort

Selection sort works by finding the minimum value and putting it at the end of the sorted part.

// Selection sort  
for (int k = 0; k < medianKernelLength - 1; k++)  
{  
	int minIndex = k;  
	for (int j = k + 1; j < medianKernelLength; j++)  
	{  
		if (values[j].a < values[minIndex].a)  
		{  
			minIndex = j;  
		}  
	}
	
	if (minIndex != k)  
	{  
		half4 temp = values[k];  
		values[k] = values[minIndex];  
		values[minIndex] = temp;  
	}  
}

Non-Unrolled: 1.44ms

Unrolled: 1.34ms

Huh... It appears to be the fastest algorithm after bubble sort, but the compiler was unable to unroll it properly.

Min-max selection sort

One modification for the selection sort is to find the minimum and maximum value in the unsorted part and keep both ends of the array sorted. It will reduce the number of iterations by a factor of two compared to the original selection sort. Let's try.

int lastIndex = medianKernelLength - 1;  
for (int i = 0; i < lastIndex; i++)  
{  
	int minIndex = i;  
	int maxIndex = i;
	
	for (int j = i + 1; j <= lastIndex; j++)  
	{  
		if (values[j].w < values[minIndex].w)  
			minIndex = j;
		
		if (values[j].w > values[maxIndex].w)  
			maxIndex = j;  
	}
	
	// Swap  
	half4 temp = values[i];  
	values[i] = values[minIndex];  
	values[minIndex] = values[i];
	
	if (maxIndex == i)  
		maxIndex = minIndex;
	
	temp = values[lastIndex];  
	values[lastIndex] = values[maxIndex];  
	values[maxIndex] = temp;
	
	lastIndex--;  
}

Non-Unrolled: 1.22ms

Unrolled: 0.54ms

And now, unrolling worked again, making the algorithm 2.5 times faster.

Network sort

Network sort works by iterating the precomputed index pairs. For each layer, compare the corresponding elements and swap them if necessary.

Index pairs are precomputed for a given number of elements. This algorithm has no branching, and its execution time remains constant.

:center-px:

:image-description:

Sorting network for 9 elements. Source: https://bertdobbelaere.github.io/sorting_networks.html#N9L25D7

Let's implement network sort. Network sort is designed for a fixed number of elements in the collection, so I need to generate the index pairs for 9 elements somehow. There are various algorithms that can generate those pairs. The most common is Bitonic Sort, which can generate index pairs for power-of-two elements. However, 9 elements is a common case, and I can easily find a proven optimal sorting network on the Internet.

I found this one:

https://bertdobbelaere.github.io/sorting_networks.html#N9L25D7

:center-px:

I will convert all the pairs and execute the compare-and-swap function one by one. So the pair (3, 6) will be CompareAndSwap(values[3], values[6])

void CompareAndSwap(inout half4 a, inout half4 b)  
{  
    [flatten]  
    if (a.w > b.w)  
    {  
       half4 temp = a;  
       a = b;  
       b = temp;  
    }  
}

// Network sort  
CompareAndSwap(values[0], values[3]);  
CompareAndSwap(values[1], values[7]);  
CompareAndSwap(values[2], values[5]);  
CompareAndSwap(values[4], values[8]);
CompareAndSwap(values[0], values[7]);  
CompareAndSwap(values[2], values[4]);  
CompareAndSwap(values[3], values[8]);  
CompareAndSwap(values[5], values[6]);
CompareAndSwap(values[0], values[2]);  
CompareAndSwap(values[1], values[3]);  
CompareAndSwap(values[4], values[5]);  
CompareAndSwap(values[7], values[8]);
CompareAndSwap(values[1], values[4]);  
CompareAndSwap(values[3], values[6]);  
CompareAndSwap(values[5], values[7]);
CompareAndSwap(values[0], values[1]);  
CompareAndSwap(values[2], values[4]);  
CompareAndSwap(values[3], values[5]);  
CompareAndSwap(values[6], values[8]);
CompareAndSwap(values[2], values[3]);  
CompareAndSwap(values[4], values[5]);  
CompareAndSwap(values[6], values[7]);
CompareAndSwap(values[1], values[2]);  
CompareAndSwap(values[3], values[4]);  
CompareAndSwap(values[5], values[6]);

No-Flatten: 0.04ms

Flatten: 0.04ms

Looks like I have my winner.

Benchmark results

And here are the results.

:center-px:

:image-description:

Results, time in milliseconds.

The most interesting part is that bubble sort is one of the fastest sorting algorithms here. When unrolled, it outperforms insertion and selection sort. Why is that?

After I manually unrolled the bubble sort, I quickly noticed that it resembles a non-optimal network sort. It's a bunch of fixed-element comparisons without any branching.

Execution of other sorting algorithms, like insertion sort or selection sort, depends on the source data, which makes them slower.

:center-px:

Also, when comparing bubble sort and insertion sort, it looks like the compiler couldn’t figure out how to avoid using slow memory in the insertion sort.

In the bubble sort and network sort, compiler created a program that works on immediate registers of the GPU, while other sorting algoritms use local memory with a slow access patterns.

Simple and easy-to-predict program flow allowed the compiler skip memory accesses completely, implementing all of that using the fast registers.

:center-px:

:center-px:

:image-description:

In insertion sort, the compiler uses many instructions for the slow memory unit (LSU - Long Scoreboard Unit, Load/Store to Local Memory), while in Bubble sort, this memory is never used.

Summary

Sorting algorithms are frequently used in GPU shaders for effects like denoising and terrain rendering optimization.
Median filtering, implemented using sorting, is effective at reducing artifacts and flickering in bloom effects and baked textures.
Manual loop unrolling dramatically improves the performance of some algorithms, particularly bubble sort.
Network sort is found to be the fastest for this use case, but unrolled bubble sort is surprisingly competitive due to its simplicity and lack of branching.
The GPU’s architecture (lack of recursion, benefits of unrolling, memory access patterns) heavily influences which sorting algorithm is most efficient.

‹ VRAM bandwidth and its big role in optimization

How to profile the rendering - GPU profiling basics ›