Rendering 90,000 particles in Unity: what actually costs CPU and GPU time
Jun 4, 2026
15 min

Guide to instancing in Unity 6.3
I render objects. A lot of objects.
What is the best way to render so many objects in Unity efficiently? Here I will show you the most common methods of rendering many objects in Unity 6 with URP.
I will explain the particle implementation and then render the same scene using these methods:
GameObjects
GameObjects with GPU Resident Drawer
Graphics.RenderMeshGraphics.RenderMeshInstancedGraphics.RenderMeshPrimitivesBatch Renderer Group
Let's see which rendering method is the fastest on the CPU and GPU, and which one is the easiest to implement.
___
Particle Simulation
Let's start by rendering something nice. I created particles that are simulated on the CPU. Each particle has position, velocity, and size.
I prepared some parameters for the simulation, where I can control the particle behaviour:
This is how all the particles look when initialized, rendered as sphere gizmos for each particle.

Then I added simulation with:
Gravity
Noise-based velocity modulation
A force that pulls the particles towards the center
Air drag
Movement according to velocity
Bouncing off the ground
I implemented that using a multithreaded job with the Burst compiler:
And this is how it looks for 5000 particles, rendered using Gizmos:
For 5000 particles, the particle calculation time takes ~0.16 ms on i5-10400F:

For 90000 particles, the median time is 1.81 ms.

___
Profiling
In the next section I will test different methods of rendering those 90000 particles. For each method I will measure the impact on CPU and GPU.
My profiling setup for every measurement is:
CPU: i5-10400F
GPU: RTX 3060 12 GB
RAM: DDR4 64 GB 1333 MHz
Resolution: 1600x900
Unity: 6000.3.11f1 (Unity 6), URP, Windows Build IL2CPP Master, debugging and deep profiling disabled, DX12
___
1. GameObjects
Let's start with the slowest rendering method. I will create a GameObject for each particle and animate its transform.
So I created this GameObject as a prefab:

And at the start of the simulation, I create one instance of this GameObject for each particle.
Then in each frame I modify the transform of each object to match the particle position.
Gameobjects: CPU performance
For now I have the SRP Batcher enabled and GPU Resident Drawer disabled.
Notice how all the threads are busy all the time, preparing the batches. That leaves fewer threads available for particle calculation, which makes it a bit longer.

Frame median for the main thread: 81.76 ms
ParticleSimulation.Update: 22.66 ms
UpdateAllRenderers: 4.43 ms
Waiting for render thread jobs: 20.75 ms
FinishFrameRendering: 31.93 ms
Bottleneck: main thread and render thread
Gameobjects: GPU performance
The GPU looks much better than the CPU.
GPU frame time: 8.56 ms

On the GPU there is a clear bottleneck on communication between CPU and GPU. There is too much happening between draw calls.

Also, each particle is rendered with a separate draw call for each shadowmap cascade and color pass. The draw call count here is crazy: 270 010 draw calls per frame.
___
2. GameObjects with GPU Resident Drawer
The SRP Batcher in Unity can't do instanced rendering on the GPU. That means every renderer creates a new draw call.
The GPU can render the same mesh many times in one draw call when using instanced rendering. That is useful when you need to render many objects that use the same mesh and material, which is the case here.
GPU Resident Drawer is a new rendering method in Unity that is focused on instanced rendering. It finds which objects in the scene use the same mesh and material, and then renders them using instanced rendering.
It greatly reduces the time needed for individual draw calls, but adds some time to figure out which objects can be batched. In this case it should be beneficial.
GPU Resident Drawer can be enabled in the Universal Render Pipeline Asset:
GPU Resident Drawer: CPU performance
So I just enabled GPU Resident Drawer. Below you can see that the frame time dropped greatly from 81.76 ms to 25.79 ms.
There is much more free time on other threads.

Frame median for the main thread: 24.88 ms
ParticleSimulation.Update: 9.78 ms
UpdateAllRenderers: 3.77 ms
GPU Resident Drawer: 8.30 ms
FinishFrameRendering: **2.52 ms
Bottleneck: main thread
Interestingly, setting transform position for each object got 2-3x faster when GPU Resident Drawer was enabled. The only difference between the two sessions is that all other worker threads were free when GPU Resident Drawer was enabled. Maybe in the first case, other threads were locking the hierarchy data for some reason.
Now, one of the slowest parts of the rendering is setting the transforms for the particles. One optimization I can do in this case is to use Transform Access Array, a Burst-compiled job that can set transform values much faster. You can find more about that in the documentation, where there is a nice example of how to use it to speed up the transform operations:
https://docs.unity3d.com/6000.4/Documentation/ScriptReference/Jobs.IJobParallelForTransform.html
I implemented that for the particles, and these are the results:

Frame median for the main thread: 18.91 ms
ParticleSimulation.Update: 3.41 ms
UpdateAllRenderers: 4.13 ms
GPU Resident Drawer: 8.24 ms
FinishFrameRendering: 2.61 ms
Using Transform Access Array saved 6.37 ms on particle update time compared to GPU Resident Drawer without it (9.78 ms → 3.41 ms). Overall main thread time dropped from 24.88 ms to 18.91 ms, saving 5.97 ms.
Bottleneck: main thread
GPU Resident Drawer: GPU performance
Now let's find out what the rendering performance of GPU Resident Drawer on the GPU is.

GPU frame time: 5.05 ms, where:
1.96 ms - sending instance data to the GPU
3.10 ms - render time
Compared to plain GameObjects without GPU Resident Drawer, GPU frame time dropped from 8.56 ms to 5.05 ms, saving 3.51 ms.
Quick note. GPU Resident Drawer is not faster in every case!
In my article about occlusion culling it was actually slower
___
3. Graphics.RenderMesh
Now let's see what happens if I don't use GameObjects at all, but invoke all the draws manually from C# code.
I will iterate through each particle and use Graphics.RenderMesh to invoke the draw call for each instance. Behind the scenes it works like creating an immediate-mode renderer in the scene with one-frame lifetime.
So, here is the code I added to the Update method:
Graphics.RenderMesh: CPU performance
Well... it doesn't look good. This way of rendering creates a huge problem for the main thread and render thread at the same time.

Frame median for the main thread: 92.53 ms
ParticleSimulation.Update: 28.38 ms
FinishFrameRendering: 51.77 ms
Clear immediate renderers: 10.94 ms
Bottleneck: main thread
Graphics.RenderMesh: GPU performance
GPU performance also doesn't look good. Because each draw call is separate and invoked from code, Unity can't batch anything.
The bottleneck is on CPU-GPU communication, with very low throughput.
There are too many state switches between the 270 010 draw calls.
Measured frame time was 31.34 ms


___
4. Graphics.RenderMeshInstanced
Now let's look at instancing. There are a few ways to use instancing in Unity. For now I will try Graphics.RenderMeshInstanced. This method allows me to render one mesh multiple times by providing an array of local-to-world matrices.
The limitation of this method is that it uses a uniform matrix array in the shader, and uniform values have a limited max size. It can only get a max of 1023 matrices at a time. That means for 90k particles I will still have ~90 draw calls, each with 1023 particles and one draw call with the remaining ones.
The way it works is that Unity creates an immediate-mode renderer for each draw issued.
The code looks like this:
Note about further optimization
The above code could be optimized further. I could:
Simulate the particles in
Update(),Schedule this job without calling
Complete(), but callJobHandle.ScheduleBatchedJobs()to ensure the job starts immediately without blocking the main thread,In
LateUpdate()call.Complete()on the job handle and render the particles using the Graphics API.
In this way, the matrix calculation will happen between Update() and LateUpdate() on another thread. So it is free as long as worker threads are free.
Anyway, I don't have any gameplay logic here, so this optimization would not change the measurements much.
Adjusting the shader
To make the shader support rendering 1023 instances at a time, according to the documentation, I need to add #pragma instancing_options assumeuniformscaling to the shader. Otherwise it will be able to render just 511 objects at a time, as it will need to use inverse-model matrices for normal transformations.
I created a custom Shader Graph with a custom function node with this pragma and used it in rendering. Now this shader should be able to render 1023 instances in one draw call.

Let's check the rendering performance.
Graphics.RenderMeshInstanced: CPU performance
Most of the CPU time on the main thread is on preparing the batches (3 ms) and on waiting for the GPU to finish the rendering.

However, the render thread becomes the problem here, because preparing those batches becomes quite costly for the render thread. Here you can see how much time it spent filling in the data and invoking the draw calls:

Notice how shadow rendering takes only 2 threads and opaque rendering only one thread. Unity wasn't able to prepare the batches using more than 3 threads at a time.
Also, it makes some of the threads unavailable for particle simulation and matrix calculation, so it also slows down the particles.

Median CPU time was 9.92 ms:
ParticleSimulation.Update: 4.44 ms
FinishFrameRendering: 0.93 ms
Wait for last present: 3.52 ms (actually caused by slow render thread rather than slow rendering on the GPU)
Bottleneck: render thread
Graphics.RenderMeshInstanced: GPU performance
The GPU looks a bit better. I measured 4.78 ms for a frame time.

However, the slow render thread caused the GPU to wait between the frames. Notice the empty spaces between the frames. This is the time where the GPU was waiting for the CPU to submit the command lists.

And the performance issue is on PCIe communication with the CPU.

___
5. Graphics.RenderMeshPrimitives
Now let's look at the instancing method that is not restricted by the max object count: Graphics.RenderMeshPrimitives.
This method allows me to render any number of instances in one draw call. There is one drawback: I need to figure out how to feed the transformation data for each instance on my own. So it is not natively supported by URP shaders out of the box.
I store all the particles in a NativeArray already. I could just send this data directly to the GPU. Then I could use it directly in the shader to move the particles.
C# implementation
At the start of particle simulation, I will allocate a graphics buffer for the particles data. Graphics Buffer here represents the "array" of data that lives on the GPU.
In each frame, after the simulation is finished, I will sync the state of the particles with this GPU buffer. Then I will just render the particles using Graphics.RenderMeshPrimitives. Unity will take care of invoking the draw calls for shadowmap and opaques.,
Shader for procedural instancing
Now the data should be accessible in the shader code under StructuredBuffer<ParticleData> _Particles, so it is time to create a shader for it.
I decided to use shader graph for it. My goal is to modify the position for each particle.
I need to access the InstanceID to access the index of the rendered particle.
I will use a custom function node to implement the position offset for each particle.

When procedural instancing is used, the local-to-world matrix is set to identity. So object space is the same as the world space.
The custom node I created uses the HLSL code that implements the instancing.

And this is the HLSL implementation. Explanations are in the code's comments:
And that's it! Time to create a material that uses this shader and use it to render the particles.
Graphics.RenderMeshPrimitives: CPU performance
This is the most efficient method of rendering. Notice that the highest overhead of this method is on sending the particle data from CPU to GPU.
This rendering method needs to copy the particle data for the render thread. Then the render thread sends the data to the GPU.
Median time copying the data was 0.263 ms

On the render thread, the data was sent to the GPU using BufferD3D12.UpdateInternal:

And it matches the GPU time reported by NVIDIA Nsight (it is not the same frame).

Median frame time for the CPU was 3.40 ms:
Particle update: 2.12 ms (simulation: 1.84 ms, invoking the draw call: 0.26 ms)
FinishFrameRendering: 0.72 ms
Wait for present: 0.24 ms
Bottleneck: GPU, but both CPU and GPU were quite equal in this scenario.
Graphics.RenderMeshPrimitives: GPU performance
GPU frame time was ~3.02 ms, with ~0.25 ms of waiting between the frames.

Bottleneck is not on the PCIe. All the throughputs are low. That means the rendered particles are too small on the screen, with triangles that are too small to render efficiently. So to optimize the GPU further, I would need to change the mesh for the particles (for example, render quad impostors instead of full cube meshes). I could also play with the mesh format and try to render many particles within a single mesh to see if the workload distribution on this GPU would get better.

And I can see that for the whole frame only 12 draw calls started:
Also in NVIDIA Nsight I can see that the async queue is used to send the particle data to the GPU and it happens during the rendering.
This data copy process increases the VRAM usage from 1% to 4%.


___
6. Batch Renderer Group
And finally, there is one more method of instanced rendering in Unity: Batch Renderer Group.
Batch Renderer Group was created for the DOTS tech stack to work with ECS. However, it is possible to use it outside of ECS. The good thing about BRG is that it supports rendering with URP shaders.
The concept is very similar to procedural instancing: send the data to the GPU once and render all instances at once, reading properties of each instance from a GPU buffer. But Batch Renderer Group takes a completely different path for the implementation.
You do not issue a draw call from Update(). Instead you:
Register a batch with Unity,
Keep a GPU buffer of per-instance transforms up to date,
Answer a culling callback when the engine is ready to render, and invoke the render command for each instance from a Burst-compatible multithreaded job.
All of that allows you to:
cull the draws using a multithreaded Burst-compiled job,
invoke draw calls with different meshes and different materials, as long as they use the same per-instance data.
But this flexibility comes with the cost of massive boilerplate code. Let me show you how it works.
BRG - implementation in short
Initializing the BRG
At setup, I created a BatchRendererGroup, registered my mesh and material, and allocated a raw GraphicsBuffer that holds one transform pair per particle (unity_ObjectToWorld and unity_WorldToObject). Below snippets are called once - to initialize the BRG.
BRG expects transforms in a packed 3×4 layout (12 floats per matrix, not a full 4×4). The buffer layout is: padding, then the objectToWorld array, then the worldToObject array. I need to calculate the byte offset of each section in this buffer:
Because I can pack the instance data in many ways, and it is supported by the URP shaders - they need to know how to find those matrices. I need to tell Unity how to read the data in the shader. This is what metadata is for:
Now, I can add one instanced batch with the BRG. It tells unity "Here is one batch of instances - this is how to read their data, and this is the GPU buffer that holds it"
Now the batch of instances is properly registered by the engine. This was just the initialization. Now it is time to render!
Updating the instances
Now my responsibility is to keep the instance data up to date. I will do this in the Update method.
Culling
When creating BatchRendererGroup, I specified a culling callback. This is a method that Unity invokes when the batch is in the camera frustum and ready to be rendered for the current frame.
Now I need to use this callback to cull the instances. Here I could do frustum culling, custom occlusion culling or whatever needed to filter the instances to render.
And now the implementation is complete. Time to see the performance.
Batch Renderer Group: CPU performance
On the main thread, ~1.64 ms was added to compute all the matrices for the particles and send them to the GPU buffers.

Median CPU time: 5.43 ms
Particles update: 3.65 ms
FinishFrameRendering: 0.90 ms
I can also see my job for calculating draw commands executed on one worker thread:

And on the render thread, ~2 ms was added to send the matrix data from CPU to GPU.

Bottleneck: main thread and render thread
Batch Renderer Group: GPU performance
On the GPU, the render time is 3.93 ms

About 0.83 ms was spent copying the data, probably the matrices, then 3.10 ms to render the frame. So the rendering performance is very similar to procedural instancing.
However, while the performance is high, the number of draw calls is not 12, like in procedural instancing. The GPU invoked 1067 draw calls to render the batches. So internally, when Unity does culling and filtering, it reorganizes the instances.
My thoughts on BRG
Personally, I'm not a fan of this API. There is a lot of boilerplate code needed to make it work, and it is quite mind-bending to work with.
I'm more a fan of procedural instancing, where I do data updates, culling, and draw call invocation on my own.
Summary
I compared six ways to render 90 000 CPU-simulated particles in Unity 6 with URP, tested in Unity 6000.3.11f1.
The same particle simulation runs in all tests (~1.81 ms on its own). The rendering method makes the biggest difference.

This is the summary
Rendering method | Median CPU time | Particle update time | FinishFrameRendering + GPU Resident Drawer | Wait for GPU (median) | Median GPU time | Main bottleneck |
|---|---|---|---|---|---|---|
GameObjects | 81.76 ms | 22.66 ms | 31.93 ms + 0 ms | 20.75 ms | 8.56 ms | Main thread (draw call prep + transform updates) |
GameObjects + GPU Resident Drawer | 25.79 ms | 9.78 ms | 2.52 ms + 8.30 ms | 0 ms | 5.05 ms | Main thread (GPU Resident Drawer + transforms) |
GameObjects + GPU Resident Drawer + Transform Access Array | 18.91 ms | 3.41 ms | 2.61 ms + 8.24 ms | 0 ms | 5.05 ms | Main thread (GPU Resident Drawer) |
| 92.53 ms | 28.38 ms | 51.77 ms + 0 ms | 0 ms | 31.34 ms | Main thread (270k immediate-mode draw calls) |
| 9.92 ms | 4.44 ms | 0.93 ms + 0 ms | 3.52 ms | 4.78 ms | Render thread (batch prep stalls CPU) |
| 3.40 ms | 2.12 ms | 0.72 ms + 0 ms | 0.24 ms | 3.02 ms | GPU (small triangles; CPU/GPU balanced) |
Batch Renderer Group | 5.43 ms | 3.65 ms | 0.90 ms + 0 ms | 0 ms | 3.93 ms | Main thread + render thread (matrix upload) |
Key takeaways for rendering a lot of instances of the same mesh:
Avoid one GameObject per instance unless you enable GPU Resident Drawer. Even then, transform updates remain costly unless you use Transform Access Array jobs. Also GPU Resident Drawer becomes costly on the Main Thread with many renderer instances present.
Graphics.RenderMeshper particle is the worst option, worse than GameObjects, because Unity creates and clears immediate-mode renderers every frame and is not able to minimize the GPU state switches between the draw calls.Graphics.RenderMeshInstancedis much better but limited to 1023 instances per draw call and still stresses the render thread preparing batches. Also instanced rendering in this way underutilizes the worker threads.Graphics.RenderMeshPrimitivesis the fastest overall for this use case: one draw call, minimal CPU overhead (~0.26 ms to issue the draw), and only 12 GPU draw calls total. It requires a custom shader to read instance data from a structured buffer.Batch Renderer Group performs similarly to procedural instancing on the GPU but with far more boilerplate and 1067 internal draw calls.
I personally prefer Graphics.RenderMeshPrimitives for direct control over data, culling, and draw submission. It is the fastest rendering method while also being very easy to implement.

