Home

Blog

Bake AO

About

Contact

Home

Blog

Bake AO

About

Contact

Blog

VRAM bandwidth and its big role in optimization

Aug 12, 2025

10 min

My understanding of rendering performance changed completely when I realized what VRAM bandwidth is and how it affects the game performance, especially on mobile devices. Right now, tracking the memory throughput is one of my standard game profiling routines. In this article, I explain what VRAM bandwidth is, why it is important in rendering optimization, and what to look at in your game to optimize it.

VRAM, what is that?

VRAM is the GPU memory where it stores all rendering resources, like textures, render targets, meshes, etc.

:center-100:

:image-description:

This image shows the resources located in GPU VRAM. The heaviest resources are textures, render targets, and meshes.

During rendering, the GPU reads from or writes to VRAM whenever it accesses a resource. The speed of writing and reading from VRAM memory is called VRAM Bandwidth.

:center-100:

:image-description:

As shown in this diagram, data for all GPU units is loaded from VRAM through L2 memory. The arrow between L2 and VRAM represents the memory bus. The speed of this bus is called VRAM Bandwidth and is measured in GB/s.

___

When VRAM Bandwidth is used

In NSight Graphics 2022.7, there is a nice diagram of a GPU memory layout that helps to visualize how VRAM on the GPU operates. Let's look at it and reason about how the GPU memory works.

:center-100:

:image-description:

The memory diagram is in Nsight Graphics 2022.7's Range Profiler window. It clearly shows the interconnections of the GPU modules.

In the diagram, you can see that the VRAM is used by the two main GPU modules: input assembler and shader execution.

Input assembler is responsible for fetching the vertices and triangles and assembling the data for the vertex shader. The input assembler is fixed and non-programmable. What matters for the input assembler is the number of vertices in a mesh and the size of each vertex. Does the vertex contain normals, UVs, tangents, and blend weights?

Shader execution - called Streaming Multiprocessor, is responsible for executing the shaders. Shaders are programmable and can execute many various operations:

Sampling textures (Texturing units)
Create and fetch interpolators created by the vertex shader (Stream Out)
Blend target textures (CROP unit)
Depth testing (ZROP unit)

Each of these operations uses VRAM Bandwidth. Since textures are among the heaviest resources, the highest VRAM bandwidth impact comes from sampling textures, blending target colors, and depth testing. Later in the article, I focus on bandwidth optimization strategies. For now, there is more to understand about memory bandwidth.

:center-100:

:image-description:

This image illustrates which assets, settings, and operations affect VRAM bandwidth the most: sampling textures, object size on screen, mesh size and format, depth testing, screen resolution, and target texture format (LDR/HDR, MSAA enabled).

___

VRAM Bandwidth of various devices

By now, you probably have a general sense of what affects VRAM bandwidth. But why does it matter so much?

Players use a wide range of devices. Each device has a different screen resolution, refresh rate, LDR/HDR options, and GPU with various VRAM capabilities.

Let's look at some specifications of high-end, low/mid, and mobile devices:

:center-100:

:image-description:

Summary of the VRAM Bandwidth and resolutions of some devices. Notice that mobile devices render higher resolution, while having lower memory bandwidth.

Notice that players on mobile devices have GPUs with much lower VRAM bandwidth, even though their screens often have higher resolutions. It's an interesting challenge.

Let's calculate how many times we can replace all the pixels on the screen on various devices while keeping their preferred framerate. Assume a single pixel is 4 bytes, 1 byte per color channel.

Math time!

RTX 3060 overdraw capability

:center-100:

:image-description:

Source https://www.nvidia.com/pl-pl/geforce/graphics-cards/30-series/rtx-3060-3060ti/

Full HD (still the most common screen resolution) contains 2,073,600 pixels, which translates to 8.29MB to store a full screen buffer.

RTX 3060 memory bandwidth is 320GB/s. To keep smooth 60FPS, we can use a maximum of 1/60 of that per frame, so exceeding 5.72GB per frame will cause a memory drop. In reality, it's hard to reach more than 70% efficiency, so let's assume we can't exceed 4 GB per frame.

So how many times can we replace an 8.29MB buffer with a 4 GB budget?

4 GB / 8.29MB = ~446

With 446x full-screen overdraw, RTX 3060 should still keep up with smooth 60FPS. Impressive.

Quest 3 overdraw capability

:center-100:

:image-description:

Source: https://www.meta.com/pl/quest/quest-3/?srsltid=AfmBOooNtpLcJYYgMsdSb7pkok3ZaXvBpHeFM-vnOa4i2ho5PB2R6Kh3

Let's calculate the overdraw capability of Quest 3. It needs to render 6,451,200 pixels each frame. It translates to ~25.1MB.

Assume we need to render 120 FPS (the standard for Quest 3). Its memory bandwidth is 51,2GB/s, giving us a 458MB budget per frame. With a maximum of 70% efficiency, this drops to 320MB per frame.

So how many times can we replace a 25.1MB buffer with a 320MB budget?

320MB / 25.1MB = ~12

So, on Quest 2 we can only replace the screen contents 12 times before framerate drops!

:center-100:

:image-description:

I fed the numbers into the calc sheet and calculated the theoretical possible overdraw per frame for those devices. Notice that when comparing just memory bandwidths, the Samsung Galaxy S23 Ultra has 10x lower performance than the RTX 3060. Quest 2 has ~40x lower performance than RTX 3060.

As you can see, mobile devices have much less memory bandwidth than PCs. This means you need a different approach to rendering optimization. For low-end or mobile devices, make memory bandwidth your main focus.

BUT: This math experiment is very theoretical. In reality, transparency uses double the bandwidth to read colors first. Depth tests read and write to the depth buffer. We also sample textures in shaders. Some optimizations are available, like Foveated Rendering (rendering lower resolution at the screen edges) frame generation, and upscaling. It is a good rule not to exceed 50% of the theoretical overdraw.

If you want to find out something interesting about new Switch 2 GPU performance, do the following, because the results are interesting: Find the resolution, framerate, and VRAM bandwidth of Switch 2, then compare that to RTX 3060, which runs on the same architecture. 👀

___

How to check if my game is VRAM-bound?

:center-100:

GPUs can struggle with performance because of various reasons, but the main bottlenecks are:

VRAM Bandwidth
Computation capabilities
You just try to render too much

How can you determine whether to focus on memory bandwidth or shader computation?

I created a separate article about GPU profiling basics. In short:

1. Use a GPU profiler provided by the GPU vendor. I use Nvidia and Intel GPUs mostly, so I use Nvidia Nsight Graphics and Intel GPA.

2. Capture the frame of your game and analyze it using those tools.

3. Look where the GPU spends the most time. SM units? Your shader is too math-heavy, and you should focus on optimizing the shader complexity. Texturing units? Color blending? Depth testing? You should focus on optimizing for memory bandwidth, not shader code itself.

:center-100:

:image-description:

Here is my article about profiling the GPU: https://www.proceduralpixels.com/blog/how-to-profile-the-rendering-gpu-profiling-basics

Let's just look at the two scenarios I profiled using the Nvidia Nsight.

:center-100:

:image-description:

In the first scenario, the GPU spent the most time in the Screen Pipe - blending the colors. In the second scenario, the bottleneck was created by SM units, processing the shader code.

___

How to profile mobile phones?

In my experience, mobile GPU profilers often crash or don't give useful data. To simulate mobile performance on a PC, I use Intel GPA with the Unity editor and run the game on an integrated Intel GPU. Since integrated GPUs use standard RAM, their memory bandwidth is similar to that of mobile devices. I analyze the GPU trace and use it as a guide for mobile optimization.

___

VRAM bandwidth optimization techniques

Here, I will briefly cover VRAM bandwidth optimization techniques.

As discussed earlier, VRAM bandwidth is mostly used by sampling textures, blending target textures, depth testing, and reading meshes. I will quickly go through each and explain how to optimize them.

Sampling textures

When you sample a texture for the first time, the GPU loads its contents from VRAM to L2 (faster memory). The texturing units then load samples from L2 and provide them to the shader execution units.

If you sample a texture many times, the texturing Units will check if the texture data is already present in the Texturing units' memory (L1), then L2, and then fall back to VRAM reads.

When I optimize for memory bandwidth, this is what I keep in mind:

✔️ Higher texture resolution = more cache misses = higher VRAM usage

✔️ Sampling fewer textures = lower VRAM usage

✔️ Mipmaps help cache efficiency - they lower the used texture resolution when the objects appear smaller on screen. Usually, when mipmaps are enabled, VRAM usage is reduced.

✔️ Sampling textures using random UVs, triplanar mapping, or marching through them is slower than standard model texturing. Those techniques make L1 and L2 cache inefficient, so VRAM reads are more frequent.

:center-100:

:image-description:

I created a decal shader with a parallax effect that used several samples to traverse the heightmap in the fragment shader. I tested a 2K heightmap and a 256x256 heightmap. The rendering result was similar, but the lower resolution texture rendered almost 3x faster.

Color blending

Each time you render a mesh, it replaces some of the pixels on the screen.

When I optimize the VRAM bandwidth on color blending, I keep those things in mind:

✔️ Transparent objects use 2x more memory bandwidth than opaque ones. They need to read the target color first, blend it, and then write it.

✔️ Deferred rendering uses much more memory bandwidth because it renders opaque objects into multiple textures at once.

✔️ Rendering opaque objects from front to back will be more efficient, because depth testing discards more pixels from being rendered.

✔️ Lowering the target resolution will decrease the amount of rendered pixels. When you lower the resolution from 1920x1080 to 1280x720, the pixel count decreases by half!

✔️ Change the target color format to LDR. Changing the R16G16B16A16 color format to R8G8B8A8 will decrease the used memory bandwidth by a factor of 2.

✔️ Lower the size of objects on the screen: Lower the amount of transparent objects. Decrease the particle count and their size, reduce the layering of UI images, etc.

✔️ Lower the amount of postprocesses. Each postprocess adds a full read-write overdraw!

:center-100:

:image-description:

I rendered the same frame at two different resolutions. Lowering the target resolution made the frame render faster: 48% fewer pixels on the screen, 31% faster frame render. The speedup is not linear because some parts of frame rendering are not resolution-dependent, such as shadow maps.

Depth testing

Opaque meshes and shadow rendering use depth testing to determine if rendered objects are in front of those already rendered.

Here is what I keep in mind when optimizing for the depth testing:

✔️ Lowering the precision of the depth buffer will lower the used memory bandwidth.

✔️ Rendering objects front to back means that there are fewer depth writes.

✔️ Disabling shadows on additional lights and lowering the shadow resolution will increase the bandwidth efficiency.

✔️ Occlusion culling disables the rendering of culled objects, so it reduces the depth reads required for depth testing.

✔️ Depth prepass will increase the depth overdraw, but it will reduce the color overdraw.

:center-100:

:image-description:

This image shows how reducing VRAM bandwidth from shadowmap rendering can speed up render time. When optimizing for one bottleneck, others can arise. Lowering shadowmap resolution increases stress on the World Pipe, which handles vertex fetching and triangle rasterization. To optimize further, I could simplify the models or decrease the LOD switch distance.

Other techniques

There are also other techniques to minimize the use of memory bandwidth. Those includeL

✔️ Using procedural textures instead of sampled ones (usually worth it if that's a really simple texture, like value-noise)

✔️ Rendering lower-res and upscaling, ex. DLSS/FSR. An important caveat is that the render pipeline needs to be adjusted for those, and usually upscalers require us to render additional motion vector textures, which still makes them a poor fit for weak mobile devices.

✔️ Frame-gen - rendering at a lower framerate and allowing the GPU to extrapolate missing frames from the previous ones. Available on Quest 3

✔️ Foveated Rendering - rendering the original resolution at a player's eye-focus point, rendering lower resolution around - available on Quest 3

✔️ Variable rate shading - available on newer GPUs - executes fewer fragment shaders inside triangles. Edges of the triangles stay sharp, but lower resolution is rendered inside the rendered triangle. Something like Foveated Rendering, but the developers have full control over screen tiles and their resolution.

✔️ Various GPU architectures handle VRAM differently - tiled GPU architecture does much more operations in cache instead of VRAM, which increases its possible memory bandwidth. Tiled architectures are usually used on mobile. So it's best to profile first - then optimize.

___

Summary

VRAM bandwidth controls how fast a graphics card can use game data like textures, render targets and meshes.

Mobile devices are order of magnitude slower than PCs when comparing VRAM bandwidth performance.

When optimizing rendering for mobile, focus primarily on VRAM bandwidth.

Native GPU profilers can help you check if your game is limited by VRAM bandwidth.

Optimize it by lowering texture quality, using fewer transparent objects, and reducing render resolution and changing render target formats, etc.

___

:center-100:

Unfortunately, I rarely get feedback about my articles, but I know that ~500 developers read those monthly! If you happen to be on LinkedIn, consider leaving feedback in a comment, spark a discussion or leave a reaction - I will really appreciate that:

Link to my LinkedIn post

If you want to learn about GPU profiling basics, check this article:

https://www.proceduralpixels.com/blog/how-to-profile-the-rendering-gpu-profiling-basics

‹ Rendering into texture - the most important TA skill

GPU Sorting Algorithms - Benchmarks ›