Rendering
GPU Sorting Algorithms - Benchmarks
Jul 31, 2025
15 min
In this article 1. I will explain where I've used sorting algorithms in practice. 2. I will explain how to improve bloom quality by implementing a median filter, which uses sorting. 3. I will benchmark many sorting algorithms and compare their performance. I will test all of them on the bloom prefiltering example. 4. Summary. I assume that you are familiar with writing custom shader code and with basic sorting algorithms (bubble sort, selection sort, insertion sort).
Where GPU sorting is used
I personally used sorting in my shaders for two effects: denoising and terrain rendering.
Denoising:
The simplest denoising algorithm is the median filter, which I implemented by gathering neighboring pixel values, sorting them by brightness, and outputting the median value.
I applied this denoising method to baked textures, bloom prefiltering, and other visual effects.
:center-px:

Denoising baked textures significantly reduces many baking artifacts.
:center-px:

Denoising also helps reduce bloom flicker, which I explain further in the next section.
Terrain renderer
I also used sorting algorithms a couple of times to filter the most important textures to sample, typically in terrain rendering.
Imagine you have a terrain with many texture layers. Each layer has its own set of textures (albedo, normal, specular, etc.), and all these textures are stored in an array. Splat textures, known as splatmaps, control the alpha value (opacity or blend amount) of those layers.
To render one layer, I needed to sample all its textures and apply them using the alpha value stored in the splatmap.
:center-px:

Visualisation of how splatmap rendering works. Each channel in the splatmap texture controls the alpha value of one layer. Source images were yoinked from a brilliant Catlikecoding tutorial: https://catlikecoding.com/unity/tutorials/rendering/part-3/
If the terrain contains 16 layers, it wastes resources to sample and calculate all of them for each pixel when most layers are not visible.
This is where sorting kicks in. I used it to identify the four most important layers. I gathered all the splatmap values, sorted them, and only sampled textures for 4 layers.
Instead of calculating 16 layers, I could simply sample the 4 most important ones.

:image-description:
I used this optimization in the Knights of Honor 2: Sovereign.
Bloom prefiltering
I will now demonstrate how I resolved the bloom flickering issue using a median filter and then benchmark some popular sorting algorithms.
Bloom 101
Before I dive into sorting, I need to explain how the bloom works.
Bloom is the postprocess effect that floods the bright pixels into neighboring ones, effectively increasing perceived brightness of the image.
:center-px:

:image-description:
Original image on the left, bloom effect applied on the right.
The first stage of the bloom effect is to prefilter the source texture. This stage identifies pixels with brightness above a certain threshold.
:center-px:

:image-description:
Original texture on the left, prefiltered bright pixels on the right.
The next stage is to blur the prefiltered texture. This is where the "flooding" happens. Blurring the texture usually uses a pyramid buffer. It means that it downscales the source texture in a couple of steps and then upscales it back, each time blurring the contents a little.
:image-description:
This is how the pyramid buffer blur looks.
At the end, the output texture is applied to the screen using additive blending.
:center-px:

The original texture is combined with the blurred output to create a final image.
I prepared a scene with sharp specular reflections and created a script that animates the camera slightly. Unity's built-in bloom flickers when small, bright pixels appear and disappear quickly. I will show how I fixed this using a median filter.
Editing Unity's bloom shader
The project I use contains URP, and the bloom shader is inside the URP package. To modify it, I will move the URP folder from Library/Package Cache
into Packages
.
Now I can find the Bloom shader in the Unity project browser, and I can edit it.
:center-px:

Finding the prefiltering function
The first step is to find the code responsible for prefiltering.
I found this code fragment. It seems to be responsible for prefiltering the bloom texture.
The SamplePrefilter
function takes the UV (texture coordinate) and a pixel offset, making it ideal for sampling surrounding pixels. Additionally, a different code is executed when _BLOOM_HQ
is defined. This is a High Quality Filtering setting for Bloom that changes the filtering behavior.
:center-px:

I will check if that's the right spot by offsetting the UV to see if the bloom effect is misplaced on the screen as a result.

:image-description:
Now that I'm sure I'm in the correct place, I can start to implement the median filter.
Median filter
Now I will implement a simple median filter by sampling 9 neighboring pixels and sorting them using the bubble sort algorithm.
The median filter returns the pixel with median brightness among its neighbors.
First, I will define a separate method that will sample the source texture using a median filter.
Then I will implement its contents. For now, I will just return the texture color.
I prepared a sampling kernel, which specifies the relative positions of all neighboring pixels to sample. I then used a for-loop to sample neighboring pixels and save their brightness (luminance) into the alpha channel of the color.
At the end, I have 9 sampled values. Each value contains color in RGB channels and brightness in the A channel.
Now it's time to implement sorting. For now, I will use a bubble sort
Let's see the results!
:center-px:

:image-description:
Switching between the custom median filter and the original shader.
And let's profile the results! I compiled the project and profiled it in FullHD on RTX 3060.
:center-px:

This sorting algorithm took 2.13ms. For 60FPS standard, that's 12.7% of the frame budget. Not good. I need to find something better.
___
Sorting algorithms - performance benchmarks
As you saw, sorting just 9 elements on the GPU is too costly to keep it as is, so I need to figure out how to improve its performance.
To begin with performance comparison, I will create a baseline profile.
:center-px:

:image-description:
This is the original bloom prefilter performance. 0.03ms per frame. I will compare various sorting algorithms to this one.
In this section, I will conduct a series of benchmarks on sorting algorithms.
What sorting algorithms can I implement?
A GPU processor doesn't have a program execution stack. And all the functions used in the shader are inlined - so it is not possible to use any sorting algorithm that requires recursion, like quicksort or mergesort.
So I'm left with simple sorting algorithms:
bubble sort
insertion sort
selection sort
network sort
I will test them and measure the performance of each one.
Each sorting algorithm will be benchmarked in two versions. With unrolling and without unrolling.
Unroll is a compiler hint that tells it to expand the for loop during compile-time. This way, only the outcomes for each loop iteration are compiled, skipping the usual loop control instructions.
Unrolled code looks like this:
And this is non-unrolled code:
Bubble sort
Bubble sort is known for being one of the slowest sorting algorithms ever created. But you will quickly see that it's quite strong on the GPU.
It works by traversing the whole array and comparing and swapping all neighboring elements each time.
Non-Unrolled: 2.13ms
Unrolled: 0.06ms
Yes, it is not a mistake. I couldn't believe my eyes, I checked 3 times.
Unrolled bubble sort is 35x faster!
Insertion sort
Let's move forward. Insertion sort is the next one. Known as the fastest sorting algorithm on the CPU for small sets. It works by picking the unsorted element and placing it within a sorted part.
Non-Unrolled: 2.38ms
Unrolled: 1.67ms
It appears that insertion sort has limited unrolling capabilities. The `while` loop inside is not easy for the compiler to predict during compile-time.
Selection sort
Selection sort works by finding the minimum value and putting it at the end of the sorted part.
Non-Unrolled: 1.44ms
Unrolled: 1.34ms
Huh... It appears to be the fastest algorithm after bubble sort, but the compiler was unable to unroll it properly.
Min-max selection sort
One modification for the selection sort is to find the minimum and maximum value in the unsorted part and keep both ends of the array sorted. It will reduce the number of iterations by a factor of two compared to the original selection sort. Let's try.
Non-Unrolled: 1.22ms
Unrolled: 0.54ms
And now, unrolling worked again, making the algorithm 2.5 times faster.
Network sort
Network sort works by iterating the precomputed index pairs. For each layer, compare the corresponding elements and swap them if necessary.
Index pairs are precomputed for a given number of elements. This algorithm has no branching, and its execution time remains constant.
:center-px:

:image-description:
Sorting network for 9 elements. Source: https://bertdobbelaere.github.io/sorting_networks.html#N9L25D7
Let's implement network sort. Network sort is designed for a fixed number of elements in the collection, so I need to generate the index pairs for 9 elements somehow. There are various algorithms that can generate those pairs. The most common is Bitonic Sort, which can generate index pairs for power-of-two elements. However, 9 elements is a common case, and I can easily find a proven optimal sorting network on the Internet.
I found this one:
https://bertdobbelaere.github.io/sorting_networks.html#N9L25D7
:center-px:

I will convert all the pairs and execute the compare-and-swap function one by one. So the pair (3, 6) will be CompareAndSwap(values[3], values[6])
No-Flatten: 0.04ms
Flatten: 0.04ms
Looks like I have my winner.
Benchmark results
And here are the results.
:center-px:

:image-description:
Results, time in milliseconds.
The most interesting part is that bubble sort is one of the fastest sorting algorithms here. When unrolled, it outperforms insertion and selection sort. Why is that?
After I manually unrolled the bubble sort, I quickly noticed that it resembles a non-optimal network sort. It's a bunch of fixed-element comparisons without any branching.
Execution of other sorting algorithms, like insertion sort or selection sort, depends on the source data, which makes them slower.
:center-px:

Also, when comparing bubble sort and insertion sort, it looks like the compiler couldn’t figure out how to avoid using slow memory in the insertion sort.
In the bubble sort and network sort, compiler created a program that works on immediate registers of the GPU, while other sorting algoritms use local memory with a slow access patterns.
Simple and easy-to-predict program flow allowed the compiler skip memory accesses completely, implementing all of that using the fast registers.
:center-px:

:center-px:

:image-description:
In insertion sort, the compiler uses many instructions for the slow memory unit (LSU - Long Scoreboard Unit, Load/Store to Local Memory), while in Bubble sort, this memory is never used.
Summary
Sorting algorithms are frequently used in GPU shaders for effects like denoising and terrain rendering optimization.
Median filtering, implemented using sorting, is effective at reducing artifacts and flickering in bloom effects and baked textures.
Manual loop unrolling dramatically improves the performance of some algorithms, particularly bubble sort.
Network sort is found to be the fastest for this use case, but unrolled bubble sort is surprisingly competitive due to its simplicity and lack of branching.
The GPU’s architecture (lack of recursion, benefits of unrolling, memory access patterns) heavily influences which sorting algorithm is most efficient.