Rendering 90,000 particles in Unity: what actually costs CPU and GPU time

Jun 4, 2026

15 min

Guide to instancing in Unity 6.3


I render objects. A lot of objects.

What is the best way to render so many objects in Unity efficiently? Here I will show you the most common methods of rendering many objects in Unity 6 with URP.

I will explain the particle implementation and then render the same scene using these methods:

  1. GameObjects

  2. GameObjects with GPU Resident Drawer

  3. Graphics.RenderMesh

  4. Graphics.RenderMeshInstanced

  5. Graphics.RenderMeshPrimitives

  6. Batch Renderer Group

Let's see which rendering method is the fastest on the CPU and GPU, and which one is the easiest to implement.


___

Particle Simulation

Let's start by rendering something nice. I created particles that are simulated on the CPU. Each particle has position, velocity, and size.

struct ParticleData
{
	public float3 position;
	public float _pad0; // Padding to get 16-byte alignment
	public float3 velocity;
	public float size


I prepared some parameters for the simulation, where I can control the particle behaviour:

public partial class ParticleSimulation : MonoBehaviour
{
	[SerializeField] int particleCount = 512;
	[SerializeField] float3 gravity = new float3(0f, -9.81f, 0f);
	[SerializeField] float noiseStrength = 0.35f;
	[SerializeField] float noiseFrequency = 1.25f;
	[SerializeField] float centerPullStrength = 0.6f;
	[SerializeField] float airDrag = 1.5f;
	[SerializeField] float groundBounce = 0.65f;
	[SerializeField] float defaultParticleSize = 0.04f;

	// Here I store the data for each particle.
	NativeArray<ParticleData> particles;

	void OnEnable()
	{
		// Initialize all the particles inside the unit sphere
		particles = new NativeArray<ParticleData>(particleCount, Allocator.Persistent);
		for (int i = 0; i < particleCount; i++)
		{
			particles[i] = new ParticleData
			{
				position = (float3)Random.insideUnitSphere,
				velocity = float3.zero,
				size = defaultParticleSize,
			};
		}
	}

	void OnDisable()
	{
		// Dispose the allocated data when the component is disabled
		if (particles.IsCreated)
			particles.Dispose


This is how all the particles look when initialized, rendered as sphere gizmos for each particle.


Then I added simulation with:

  1. Gravity

  2. Noise-based velocity modulation

  3. A force that pulls the particles towards the center

  4. Air drag

  5. Movement according to velocity

  6. Bouncing off the ground

I implemented that using a multithreaded job with the Burst compiler:

void Update()
{
	if (!particles.IsCreated)
		return;

	// Particle simulation is implemented in the Job system - running in parallel
	var job = new SimulateParticlesJob
	{
		particles = particles,
		deltaTime = Time.deltaTime,
		gravity = gravity,
		noiseStrength = noiseStrength,
		noiseFrequency = noiseFrequency,
		noiseTime = Time.timeSinceLevelLoad,
		centerPullStrength = centerPullStrength,
		airDrag = airDrag,
		groundBounce = groundBounce,
	};

	job.Schedule(particles.Length, 64).Complete();
}

[BurstCompile]
struct SimulateParticlesJob : IJobParallelFor
{
	public NativeArray<ParticleData> particles;
	public float deltaTime;
	public float3 gravity;
	public float noiseStrength;
	public float noiseFrequency;
	public float noiseTime;
	public float centerPullStrength;
	public float airDrag;
	public float groundBounce;

	public void Execute(int index)
	{
		ParticleData p = particles[index];

		// Apply gravity
		p.velocity += gravity * deltaTime;

		// Apply perlin noise to modulate the velocity
		float3 noiseOffset = new float3(noiseTime, noiseTime * 0.37f, noiseTime * 0.19f);
		float3 samplePos = p.position * noiseFrequency + noiseOffset;
		float3 noiseForce = new float3(
			noise.cnoise(samplePos),
			noise.cnoise(samplePos + 17.3f),
			noise.cnoise(samplePos + 41.7f));
		p.velocity += noiseForce * (noiseStrength * deltaTime);

		// Apply force towards the center
		float dist = math.length(p.position);
		if (dist > 1e-5f)
		{
			float3 toCenter = -p.position / dist;
			p.velocity += toCenter * (centerPullStrength * dist * deltaTime);
		}

		// Apply air drag
		float dragFactor = math.saturate(1f - airDrag * deltaTime);
		p.velocity *= dragFactor;

		// Move the position according to the velocity
		p.position += p.velocity * deltaTime;

		// Bounce the particle off the ground
		if (p.position.y < 0f)
		{
			p.position.y = -p.position.y;
			if (p.velocity.y < 0f)
				p.velocity.y = -p.velocity.y * groundBounce;
		}

		particles[index] = p


And this is how it looks for 5000 particles, rendered using Gizmos:


For 5000 particles, the particle calculation time takes ~0.16 ms on i5-10400F:


For 90000 particles, the median time is 1.81 ms.


___

Profiling

In the next section I will test different methods of rendering those 90000 particles. For each method I will measure the impact on CPU and GPU.

My profiling setup for every measurement is:

  • CPU: i5-10400F

  • GPU: RTX 3060 12 GB

  • RAM: DDR4 64 GB 1333 MHz

  • Resolution: 1600x900

  • Unity: 6000.3.11f1 (Unity 6), URP, Windows Build IL2CPP Master, debugging and deep profiling disabled, DX12


___

1. GameObjects

Let's start with the slowest rendering method. I will create a GameObject for each particle and animate its transform.

So I created this GameObject as a prefab:


And at the start of the simulation, I create one instance of this GameObject for each particle.

particleTransforms = new Transform[particles.Length];

for (int i = 0; i < particles.Length; i++)
{
	GameObject instance = Instantiate(particlePrefab, transform);
	instance.name = $"{particlePrefab.name}_{i}";
	particleTransforms[i] = instance.transform


Then in each frame I modify the transform of each object to match the particle position.

for (int i = 0; i < particles.Length; i++)
{
	ParticleData p = particles[i];
	Transform particleTransform = particleTransforms[i];
	particleTransform.localPosition = p.position;
	particleTransform.localScale = Vector3.one * p.size


Gameobjects: CPU performance

For now I have the SRP Batcher enabled and GPU Resident Drawer disabled.

Notice how all the threads are busy all the time, preparing the batches. That leaves fewer threads available for particle calculation, which makes it a bit longer.


Frame median for the main thread: 81.76 ms

  • ParticleSimulation.Update: 22.66 ms

  • UpdateAllRenderers: 4.43 ms

  • Waiting for render thread jobs: 20.75 ms

  • FinishFrameRendering: 31.93 ms


Bottleneck: main thread and render thread


Gameobjects: GPU performance

The GPU looks much better than the CPU.
GPU frame time: 8.56 ms


On the GPU there is a clear bottleneck on communication between CPU and GPU. There is too much happening between draw calls.


Also, each particle is rendered with a separate draw call for each shadowmap cascade and color pass. The draw call count here is crazy: 270 010 draw calls per frame.


___

2. GameObjects with GPU Resident Drawer

The SRP Batcher in Unity can't do instanced rendering on the GPU. That means every renderer creates a new draw call.

The GPU can render the same mesh many times in one draw call when using instanced rendering. That is useful when you need to render many objects that use the same mesh and material, which is the case here.

GPU Resident Drawer is a new rendering method in Unity that is focused on instanced rendering. It finds which objects in the scene use the same mesh and material, and then renders them using instanced rendering.

It greatly reduces the time needed for individual draw calls, but adds some time to figure out which objects can be batched. In this case it should be beneficial.

GPU Resident Drawer can be enabled in the Universal Render Pipeline Asset:


GPU Resident Drawer: CPU performance

So I just enabled GPU Resident Drawer. Below you can see that the frame time dropped greatly from 81.76 ms to 25.79 ms.
There is much more free time on other threads.


Frame median for the main thread: 24.88 ms

  • ParticleSimulation.Update: 9.78 ms

  • UpdateAllRenderers: 3.77 ms

  • GPU Resident Drawer: 8.30 ms

  • FinishFrameRendering: **2.52 ms

Bottleneck: main thread

Interestingly, setting transform position for each object got 2-3x faster when GPU Resident Drawer was enabled. The only difference between the two sessions is that all other worker threads were free when GPU Resident Drawer was enabled. Maybe in the first case, other threads were locking the hierarchy data for some reason.

Now, one of the slowest parts of the rendering is setting the transforms for the particles. One optimization I can do in this case is to use Transform Access Array, a Burst-compiled job that can set transform values much faster. You can find more about that in the documentation, where there is a nice example of how to use it to speed up the transform operations:
https://docs.unity3d.com/6000.4/Documentation/ScriptReference/Jobs.IJobParallelForTransform.html

I implemented that for the particles, and these are the results:


Frame median for the main thread: 18.91 ms

  • ParticleSimulation.Update: 3.41 ms

  • UpdateAllRenderers: 4.13 ms

  • GPU Resident Drawer: 8.24 ms

  • FinishFrameRendering: 2.61 ms

Using Transform Access Array saved 6.37 ms on particle update time compared to GPU Resident Drawer without it (9.78 ms → 3.41 ms). Overall main thread time dropped from 24.88 ms to 18.91 ms, saving 5.97 ms.

Bottleneck: main thread


GPU Resident Drawer: GPU performance

Now let's find out what the rendering performance of GPU Resident Drawer on the GPU is.


GPU frame time: 5.05 ms, where:

  • 1.96 ms - sending instance data to the GPU

  • 3.10 ms - render time

Compared to plain GameObjects without GPU Resident Drawer, GPU frame time dropped from 8.56 ms to 5.05 ms, saving 3.51 ms.


Quick note. GPU Resident Drawer is not faster in every case!
In my article about occlusion culling it was actually slower


___

3. Graphics.RenderMesh

Now let's see what happens if I don't use GameObjects at all, but invoke all the draws manually from C# code.

I will iterate through each particle and use Graphics.RenderMesh to invoke the draw call for each instance. Behind the scenes it works like creating an immediate-mode renderer in the scene with one-frame lifetime.

So, here is the code I added to the Update method:

void Update()
{
	...

	// Uses burst compiler and multithreaded job
	new BuildParticleMatricesJob
	{
		particles = particles,
		matrices = instanceMatrices,
	}.Schedule(particles.Length, 64).Complete();

	RenderParams renderParams = CreateRenderParams(material);
	for (int i = 0; i < particles.Length; i++)
		Graphics.RenderMesh(renderParams, mesh, SubmeshIndex, instanceMatrices[i]);
}

...

[BurstCompile]
struct BuildParticleMatricesJob : IJobParallelFor
{
	[ReadOnly] public NativeArray<ParticleData> particles;
	public NativeArray<Matrix4x4> matrices;
	public void Execute(int index)
	{
		ParticleData p = particles[index];
		matrices[index] = new Matrix4x4(
			new Vector4(p.size, 0f, 0f, 0f),
			new Vector4(0f, p.size, 0f, 0f),
			new Vector4(0f, 0f, p.size, 0f),
			new Vector4(p.position.x, p.position.y, p.position.z, 1f


Graphics.RenderMesh: CPU performance

Well... it doesn't look good. This way of rendering creates a huge problem for the main thread and render thread at the same time.


Frame median for the main thread: 92.53 ms

  • ParticleSimulation.Update: 28.38 ms

  • FinishFrameRendering: 51.77 ms

  • Clear immediate renderers: 10.94 ms

Bottleneck: main thread


Graphics.RenderMesh: GPU performance

GPU performance also doesn't look good. Because each draw call is separate and invoked from code, Unity can't batch anything.
The bottleneck is on CPU-GPU communication, with very low throughput.

There are too many state switches between the 270 010 draw calls.

Measured frame time was 31.34 ms


___

4. Graphics.RenderMeshInstanced

Now let's look at instancing. There are a few ways to use instancing in Unity. For now I will try Graphics.RenderMeshInstanced. This method allows me to render one mesh multiple times by providing an array of local-to-world matrices.

The limitation of this method is that it uses a uniform matrix array in the shader, and uniform values have a limited max size. It can only get a max of 1023 matrices at a time. That means for 90k particles I will still have ~90 draw calls, each with 1023 particles and one draw call with the remaining ones.

The way it works is that Unity creates an immediate-mode renderer for each draw issued.

The code looks like this:

const int InstancedBatchSize = 1023;

void Update()
{
	...

	// Use multithreaded job to build all matrices
	new BuildParticleMatricesJob
	{
		particles = particles,
		matrices = instanceMatrices,
	}.Schedule(particles.Length, 64).Complete();

	// Prepare rendering params. This struct specifies parameters like GameObject layers, bounding box, motion vector render mode, shadow casting settings, etc.
	RenderParams renderParams = CreateRenderParams(material);

	// Render as long as there are still some particles to render
	int particleIndex = 0;
	while (particleIndex < particles.Length)
	{
		// Render particles in batches, each batch with 1023, as long as there are still some particles to render
		int batchCount = math.min(InstancedBatchSize, particles.Length - particleIndex);
		Graphics.RenderMeshInstanced(renderParams, mesh, SubmeshIndex, instanceMatrices, batchCount, particleIndex);

		// Increase particle index by the batch count
		particleIndex += batchCount;
	}
}

// Job that builds the matrix array
[BurstCompile]
struct BuildParticleMatricesJob : IJobParallelFor
{
	[ReadOnly] public NativeArray<ParticleData> particles;
	public NativeArray<Matrix4x4> matrices;
	public void Execute(int index)
	{
		ParticleData p = particles[index];
		matrices[index] = new Matrix4x4(
			new Vector4(p.size, 0f, 0f, 0f),
			new Vector4(0f, p.size, 0f, 0f),
			new Vector4(0f, 0f, p.size, 0f),
			new Vector4(p.position.x, p.position.y, p.position.z, 1f


Note about further optimization

The above code could be optimized further. I could:

  1. Simulate the particles in Update(),

  2. Schedule this job without calling Complete(), but call JobHandle.ScheduleBatchedJobs() to ensure the job starts immediately without blocking the main thread,

  3. In LateUpdate() call .Complete() on the job handle and render the particles using the Graphics API.

In this way, the matrix calculation will happen between Update() and LateUpdate() on another thread. So it is free as long as worker threads are free.

Anyway, I don't have any gameplay logic here, so this optimization would not change the measurements much.


Adjusting the shader

To make the shader support rendering 1023 instances at a time, according to the documentation, I need to add #pragma instancing_options assumeuniformscaling to the shader. Otherwise it will be able to render just 511 objects at a time, as it will need to use inverse-model matrices for normal transformations.

I created a custom Shader Graph with a custom function node with this pragma and used it in rendering. Now this shader should be able to render 1023 instances in one draw call.


Let's check the rendering performance.


Graphics.RenderMeshInstanced: CPU performance

Most of the CPU time on the main thread is on preparing the batches (3 ms) and on waiting for the GPU to finish the rendering.


However, the render thread becomes the problem here, because preparing those batches becomes quite costly for the render thread. Here you can see how much time it spent filling in the data and invoking the draw calls:


Notice how shadow rendering takes only 2 threads and opaque rendering only one thread. Unity wasn't able to prepare the batches using more than 3 threads at a time.
Also, it makes some of the threads unavailable for particle simulation and matrix calculation, so it also slows down the particles.


Median CPU time was 9.92 ms:

  • ParticleSimulation.Update: 4.44 ms

  • FinishFrameRendering: 0.93 ms

  • Wait for last present: 3.52 ms (actually caused by slow render thread rather than slow rendering on the GPU)

Bottleneck: render thread


Graphics.RenderMeshInstanced: GPU performance

The GPU looks a bit better. I measured 4.78 ms for a frame time.


However, the slow render thread caused the GPU to wait between the frames. Notice the empty spaces between the frames. This is the time where the GPU was waiting for the CPU to submit the command lists.


And the performance issue is on PCIe communication with the CPU.


___

5. Graphics.RenderMeshPrimitives

Now let's look at the instancing method that is not restricted by the max object count: Graphics.RenderMeshPrimitives.

This method allows me to render any number of instances in one draw call. There is one drawback: I need to figure out how to feed the transformation data for each instance on my own. So it is not natively supported by URP shaders out of the box.

I store all the particles in a NativeArray already. I could just send this data directly to the GPU. Then I could use it directly in the shader to move the particles.

NativeArray<ParticleData> particles; // I could just send this data to the GPU and use it as is.


C# implementation

At the start of particle simulation, I will allocate a graphics buffer for the particles data. Graphics Buffer here represents the "array" of data that lives on the GPU.

// Keep the reference to the graphics buffer
GraphicsBuffer particleBuffer;

// Material property block will be used to bind the buffer to the material
MaterialPropertyBlock proceduralMaterialProperties;

void OnEnable()
{
	...
	// Allocate graphics buffer with particles
	particleBuffer = new GraphicsBuffer(GraphicsBuffer.Target.Structured, particleCount, UnsafeUtility.SizeOf<ParticleData>());
	proceduralMaterialProperties = new MaterialPropertyBlock


In each frame, after the simulation is finished, I will sync the state of the particles with this GPU buffer. Then I will just render the particles using Graphics.RenderMeshPrimitives. Unity will take care of invoking the draw calls for shadowmap and opaques.,

void Update()
{
	...
	// Send all the particles data to the GPU
	using (Markers.SyncParticleBuffer.Auto())
		particleBuffer.SetData(particles);

	// Bind the GPU buffer with particles to the shader
	proceduralMaterialProperties.SetBuffer(Uniforms._Particles, particleBuffer);

	// Render all the particles in one draw call
	RenderParams renderParams = CreateRenderParams(proceduralMaterial);
	renderParams.matProps = proceduralMaterialProperties;
	Graphics.RenderMeshPrimitives(renderParams, mesh, SubmeshIndex, particles.Length


Shader for procedural instancing

Now the data should be accessible in the shader code under StructuredBuffer<ParticleData> _Particles, so it is time to create a shader for it.

I decided to use shader graph for it. My goal is to modify the position for each particle.

I need to access the InstanceID to access the index of the rendered particle.
I will use a custom function node to implement the position offset for each particle.


When procedural instancing is used, the local-to-world matrix is set to identity. So object space is the same as the world space.

The custom node I created uses the HLSL code that implements the instancing.


And this is the HLSL implementation. Explanations are in the code's comments:

#ifndef PARTICLE_PROCEDURAL_INSTANCING_INCLUDED
#define PARTICLE_PROCEDURAL_INSTANCING_INCLUDED

#include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/Core.hlsl"

// Must match ParticleData in ParticleSimulation.cs - with 16 byte alignment.
// 16 byte alignment is important for the performance on some GPUs
// (I read this somewhere in the Internet, I never practically tested if that's true - the idea for an article I guess)
struct ParticleData
{
	float3 position;
	float  _pad0;
	float3 velocity;
	float  size;
};

// This is the graphics buffer that stores the particles - I set this in the C# code.
StructuredBuffer<ParticleData> _Particles;

void ParticleProceduralInstancing_float(in uint IN_InstanceID, in float3 IN_PositionWS, out float3 OUT_PositionWS)
{
	// I invoked the draw call specifying the number of instances to render.
	// Here I can use the InstanceID to get the data of a single particle
	ParticleData particleData = _Particles[IN_InstanceID];

	// Now, I can just use the particle position and size to scale and offset the vertex position.
	OUT_PositionWS = IN_PositionWS * particleData.size + particleData.position;
}

#endif

And that's it! Time to create a material that uses this shader and use it to render the particles.


Graphics.RenderMeshPrimitives: CPU performance

This is the most efficient method of rendering. Notice that the highest overhead of this method is on sending the particle data from CPU to GPU.
This rendering method needs to copy the particle data for the render thread. Then the render thread sends the data to the GPU.

Median time copying the data was 0.263 ms


On the render thread, the data was sent to the GPU using BufferD3D12.UpdateInternal:


And it matches the GPU time reported by NVIDIA Nsight (it is not the same frame).


Median frame time for the CPU was 3.40 ms:

  • Particle update: 2.12 ms (simulation: 1.84 ms, invoking the draw call: 0.26 ms)

  • FinishFrameRendering: 0.72 ms

  • Wait for present: 0.24 ms

Bottleneck: GPU, but both CPU and GPU were quite equal in this scenario.


Graphics.RenderMeshPrimitives: GPU performance

GPU frame time was ~3.02 ms, with ~0.25 ms of waiting between the frames.


Bottleneck is not on the PCIe. All the throughputs are low. That means the rendered particles are too small on the screen, with triangles that are too small to render efficiently. So to optimize the GPU further, I would need to change the mesh for the particles (for example, render quad impostors instead of full cube meshes). I could also play with the mesh format and try to render many particles within a single mesh to see if the workload distribution on this GPU would get better.


And I can see that for the whole frame only 12 draw calls started:


Also in NVIDIA Nsight I can see that the async queue is used to send the particle data to the GPU and it happens during the rendering.
This data copy process increases the VRAM usage from 1% to 4%.


___

6. Batch Renderer Group

And finally, there is one more method of instanced rendering in Unity: Batch Renderer Group.

Batch Renderer Group was created for the DOTS tech stack to work with ECS. However, it is possible to use it outside of ECS. The good thing about BRG is that it supports rendering with URP shaders.

The concept is very similar to procedural instancing: send the data to the GPU once and render all instances at once, reading properties of each instance from a GPU buffer. But Batch Renderer Group takes a completely different path for the implementation.

You do not issue a draw call from Update(). Instead you:

  1. Register a batch with Unity,

  2. Keep a GPU buffer of per-instance transforms up to date,

  3. Answer a culling callback when the engine is ready to render, and invoke the render command for each instance from a Burst-compatible multithreaded job.

All of that allows you to:

  • cull the draws using a multithreaded Burst-compiled job,

  • invoke draw calls with different meshes and different materials, as long as they use the same per-instance data.

But this flexibility comes with the cost of massive boilerplate code. Let me show you how it works.


BRG - implementation in short

Initializing the BRG

At setup, I created a BatchRendererGroup, registered my mesh and material, and allocated a raw GraphicsBuffer that holds one transform pair per particle (unity_ObjectToWorld and unity_WorldToObject). Below snippets are called once - to initialize the BRG.

// Create the BRG
batchRendererGroup = new BatchRendererGroup(OnPerformCulling, IntPtr.Zero); // THIS ALREADY REGISTERS THE GROUP FOR RENDERING WITH THE CULLING CALLBACK!

// Register mesh/material handles used later in draw commands.
brgMeshId = batchRendererGroup.RegisterMesh(mesh);
brgMaterialId = batchRendererGroup.RegisterMaterial(material);

// Calculate and store the instance count
brgInstanceCount = particles.Length;

// CPU-side staging arrays - packed into the GPU buffer each frame.
brgObjectToWorld = new NativeArray<PackedMatrix>(brgInstanceCount, Allocator.Persistent);
brgWorldToObject = new NativeArray<PackedMatrix>(brgInstanceCount, Allocator.Persistent);

// Raw buffer holds all instance matrices
// First half of the buffer is local-to-world matrices, then second half is world-to-local matrices.
// Also this is a raw buffer. So I need to manually calculate the amount of bytes to allocate for all those matrices.
var bufferTarget = GraphicsBuffer.Target.Raw;
int bufferCount = BufferCountForInstances(BrgBytesPerInstance, brgInstanceCount, BrgExtraBytes);
brgInstanceData = new GraphicsBuffer(bufferTarget, bufferCount, sizeof(int));

// Size the raw buffer in 4-byte chunks (GraphicsBuffer stride is sizeof(int) here).
static int BufferCountForInstances(int bytesPerInstance, int numInstances, int extraBytes = 0)
{
	bytesPerInstance = (bytesPerInstance + sizeof(int) - 1) / sizeof(int) * sizeof(int);
	extraBytes = (extraBytes + sizeof(int) - 1) / sizeof(int) * sizeof(int);
	int totalBytes = bytesPerInstance * numInstances + extraBytes;
	return totalBytes / sizeof


BRG expects transforms in a packed 3×4 layout (12 floats per matrix, not a full 4×4). The buffer layout is: padding, then the objectToWorld array, then the worldToObject array. I need to calculate the byte offset of each section in this buffer:

// Layout of the instance data buffer: [padding][objectToWorld array][worldToObject array]
brgByteAddressObjectToWorld = (uint)BrgSizeOfPackedMatrix * 2;
brgByteAddressWorldToObject = brgByteAddressObjectToWorld + (uint)BrgSizeOfPackedMatrix * (uint)brgInstanceCount


Because I can pack the instance data in many ways, and it is supported by the URP shaders - they need to know how to find those matrices. I need to tell Unity how to read the data in the shader. This is what metadata is for:

// Tell Unity where unity_ObjectToWorld / unity_WorldToObject live in the buffer
// so the shader can index them per instance (BrgInstanceArrayFlag marks them as arrays).
var metadata = new NativeArray<MetadataValue>(2, Allocator.Temp);
metadata[0] = new MetadataValue
{
	NameID = Shader.PropertyToID("unity_ObjectToWorld"),
	Value = BrgInstanceArrayFlag | brgByteAddressObjectToWorld, // Array of object-to-world matrices
};
metadata[1] = new MetadataValue
{
	NameID = Shader.PropertyToID("unity_WorldToObject"),
	Value = BrgInstanceArrayFlag | brgByteAddressWorldToObject, // Array of world-to-object matrices


Now, I can add one instanced batch with the BRG. It tells unity "Here is one batch of instances - this is how to read their data, and this is the GPU buffer that holds it"

brgBatchId = batchRendererGroup.AddBatch(metadata, brgInstanceData.bufferHandle, (uint)brgBufferOffset, (uint)brgBufferWindowSize


Now the batch of instances is properly registered by the engine. This was just the initialization. Now it is time to render!


Updating the instances

Now my responsibility is to keep the instance data up to date. I will do this in the Update method.

void Update()
{
	...

	// Build object-to-world / world-to-object matrices from particle position and size.
	// Here I build native arrays with those matrices using multithreaded bursted job
	new PackBrgInstanceMatricesJob
	{
		particles = particles,
		objectToWorld = brgObjectToWorld,
		worldToObject = brgWorldToObject,
	}.Schedule(particles.Length, 64).Complete();

	// I upload both matrix arrays into the single raw GPU buffer at their byte offsets.
	int objectToWorldOffset = (int)(brgByteAddressObjectToWorld / BrgSizeOfPackedMatrix);
	int worldToObjectOffset = (int)(brgByteAddressWorldToObject / BrgSizeOfPackedMatrix);
	brgInstanceData.SetData(brgObjectToWorld, 0, objectToWorldOffset, brgInstanceCount);
	brgInstanceData.SetData(brgWorldToObject, 0, worldToObjectOffset, brgInstanceCount);

	// Coarse bounds for the whole batch - Unity culls the group before OnPerformCulling runs - it is good to keep it up to date.
	batchRendererGroup.SetGlobalBounds(new Bounds(transform.position, BrgGlobalBoundsSize


Culling

When creating BatchRendererGroup, I specified a culling callback. This is a method that Unity invokes when the batch is in the camera frustum and ready to be rendered for the current frame.

// This was during the initialization

// Create the BRG
batchRendererGroup = new BatchRendererGroup(OnPerformCulling, IntPtr.Zero); // THIS ALREADY REGISTERS THE GROUP FOR RENDERING WITH THE CULLING CALLBACK!


Now I need to use this callback to cull the instances. Here I could do frustum culling, custom occlusion culling or whatever needed to filter the instances to render.

// Unity calls this during rendering to ask which instances should be drawn.
// We fill cullingOutput with draw commands instead of calling Graphics.Render* ourselves.
public unsafe JobHandle OnPerformCulling(
	BatchRendererGroup rendererGroup,
	BatchCullingContext cullingContext,
	BatchCullingOutput cullingOutput,
	IntPtr userContext)
{
	if (batchRendererGroup == null || brgInstanceCount == 0)
		return default;

	// Here I will only output the instances into the output
	using (Markers.OnPerformCulling.Auto())
		return OnPerformCullingInternal(cullingOutput);
}

// No per-instance frustum test here - emit one draw command listing every instance index.
unsafe JobHandle OnPerformCullingInternal(BatchCullingOutput cullingOutput)
{
	int alignment = UnsafeUtility.AlignOf<long>();
	var drawCommands = (BatchCullingOutputDrawCommands*)cullingOutput.drawCommands.GetUnsafePtr();

	// Allocate the output structures Unity expects (freed by the engine after the job completes).
	drawCommands->drawCommands = (BatchDrawCommand*)UnsafeUtility.Malloc(
		UnsafeUtility.SizeOf<BatchDrawCommand>(), alignment, Allocator.TempJob);
	drawCommands->drawRanges = (BatchDrawRange*)UnsafeUtility.Malloc(
		UnsafeUtility.SizeOf<BatchDrawRange>(), alignment, Allocator.TempJob);
	drawCommands->visibleInstances = (int*)UnsafeUtility.Malloc(
		brgInstanceCount * sizeof(int), alignment, Allocator.TempJob);
	drawCommands->drawCommandPickingEntityIds = null;
	drawCommands->instanceSortingPositions = null;
	drawCommands->instanceSortingPositionFloatCount = 0;

	// Output the instance count using a burst-compiled job
	return new BrgOutputDrawCommandsJob
	{
		instanceCount = brgInstanceCount,
		drawCommands = drawCommands,
		batchId = brgBatchId,
		materialId = brgMaterialId,
		meshId = brgMeshId,
		submeshIndex = (ushort)SubmeshIndex,
		layer = (byte)gameObject.layer,
	}.Schedule();
}

// Fill BatchCullingOutput with one instanced draw (mesh + material + visible instance indices).
[BurstCompile]
unsafe struct BrgOutputDrawCommandsJob : IJob
{
	public int instanceCount;
	[NativeDisableUnsafePtrRestriction] public BatchCullingOutputDrawCommands* drawCommands;
	public BatchID batchId;
	public BatchMaterialID materialId;
	public BatchMeshID meshId;
	public ushort submeshIndex;
	public byte layer;

	public void Execute()
	{
		// Set the amount of draw commands and unique draw settings
		drawCommands->drawCommandCount = 1;
		drawCommands->drawRangeCount = 1;

		// Set the number of visible instances
		drawCommands->visibleInstanceCount = instanceCount;

		// Invoke one draw command with all instances, selected mesh and material etc.
		drawCommands->drawCommands[0] = new BatchDrawCommand
		{
			visibleOffset = 0,
			visibleCount = (uint)instanceCount,
			batchID = batchId,
			materialID = materialId,
			meshID = meshId,
			submeshIndex = submeshIndex,
			splitVisibilityMask = 0xff,
			flags = BatchDrawCommandFlags.None,
			sortingPosition = 0,
		};

		// Filter settings mirror what RenderParams would set (layer, shadows, motion vectors).
		drawCommands->drawRanges[0] = new BatchDrawRange
		{
			// Apply this filter settings for the first command
			drawCommandsBegin = 0,
			drawCommandsCount = 1,
			filterSettings = new BatchFilterSettings
			{
				renderingLayerMask = 0xffffffff,
				layer = layer,
				motionMode = MotionVectorGenerationMode.Object,
				shadowCastingMode = ShadowCastingMode.On,
				receiveShadows = true,
				staticShadowCaster = false,
				allDepthSorted = false,
			},
			drawCommandsType = BatchDrawCommandType.Direct,
		};

		// What instances to render - one by one
		for (int i = 0; i < instanceCount; i++)
			drawCommands->visibleInstances[i] = i


And now the implementation is complete. Time to see the performance.


Batch Renderer Group: CPU performance

On the main thread, ~1.64 ms was added to compute all the matrices for the particles and send them to the GPU buffers.


Median CPU time: 5.43 ms

  • Particles update: 3.65 ms

  • FinishFrameRendering: 0.90 ms

I can also see my job for calculating draw commands executed on one worker thread:


And on the render thread, ~2 ms was added to send the matrix data from CPU to GPU.


Bottleneck: main thread and render thread


Batch Renderer Group: GPU performance

On the GPU, the render time is 3.93 ms


About 0.83 ms was spent copying the data, probably the matrices, then 3.10 ms to render the frame. So the rendering performance is very similar to procedural instancing.

However, while the performance is high, the number of draw calls is not 12, like in procedural instancing. The GPU invoked 1067 draw calls to render the batches. So internally, when Unity does culling and filtering, it reorganizes the instances.


My thoughts on BRG

Personally, I'm not a fan of this API. There is a lot of boilerplate code needed to make it work, and it is quite mind-bending to work with.

I'm more a fan of procedural instancing, where I do data updates, culling, and draw call invocation on my own.


Summary

I compared six ways to render 90 000 CPU-simulated particles in Unity 6 with URP, tested in Unity 6000.3.11f1.

The same particle simulation runs in all tests (~1.81 ms on its own). The rendering method makes the biggest difference.


This is the summary

Rendering method

Median CPU time

Particle update time

FinishFrameRendering + GPU Resident Drawer

Wait for GPU (median)

Median GPU time

Main bottleneck

GameObjects

81.76 ms

22.66 ms

31.93 ms + 0 ms

20.75 ms

8.56 ms

Main thread (draw call prep + transform updates)

GameObjects + GPU Resident Drawer

25.79 ms

9.78 ms

2.52 ms + 8.30 ms

0 ms

5.05 ms

Main thread (GPU Resident Drawer + transforms)

GameObjects + GPU Resident Drawer + Transform Access Array

18.91 ms

3.41 ms

2.61 ms + 8.24 ms

0 ms

5.05 ms

Main thread (GPU Resident Drawer)

Graphics.RenderMesh

92.53 ms

28.38 ms

51.77 ms + 0 ms

0 ms

31.34 ms

Main thread (270k immediate-mode draw calls)

Graphics.RenderMeshInstanced

9.92 ms

4.44 ms

0.93 ms + 0 ms

3.52 ms

4.78 ms

Render thread (batch prep stalls CPU)

Graphics.RenderMeshPrimitives

3.40 ms

2.12 ms

0.72 ms + 0 ms

0.24 ms

3.02 ms

GPU (small triangles; CPU/GPU balanced)

Batch Renderer Group

5.43 ms

3.65 ms

0.90 ms + 0 ms

0 ms

3.93 ms

Main thread + render thread (matrix upload)


Key takeaways for rendering a lot of instances of the same mesh:

  • Avoid one GameObject per instance unless you enable GPU Resident Drawer. Even then, transform updates remain costly unless you use Transform Access Array jobs. Also GPU Resident Drawer becomes costly on the Main Thread with many renderer instances present.

  • Graphics.RenderMesh per particle is the worst option, worse than GameObjects, because Unity creates and clears immediate-mode renderers every frame and is not able to minimize the GPU state switches between the draw calls.

  • Graphics.RenderMeshInstanced is much better but limited to 1023 instances per draw call and still stresses the render thread preparing batches. Also instanced rendering in this way underutilizes the worker threads.

  • Graphics.RenderMeshPrimitives is the fastest overall for this use case: one draw call, minimal CPU overhead (~0.26 ms to issue the draw), and only 12 GPU draw calls total. It requires a custom shader to read instance data from a structured buffer.

  • Batch Renderer Group performs similarly to procedural instancing on the GPU but with far more boilerplate and 1067 internal draw calls.

I personally prefer Graphics.RenderMeshPrimitives for direct control over data, culling, and draw submission. It is the fastest rendering method while also being very easy to implement.

Hungry for more?

Join my community for weekly discussions on performance and profiling

Hungry for more?

Join my community for weekly discussions on performance and profiling

I write expert content on optimizing Unity games, customizing rendering pipelines, and enhancing the Unity Editor.

Copyright © 2026 Jan Mróz | Procedural Pixels

I write expert content on optimizing Unity games, customizing rendering pipelines, and enhancing the Unity Editor.

Copyright © 2026 Jan Mróz | Procedural Pixels

I write expert content on optimizing Unity games, customizing rendering pipelines, and enhancing the Unity Editor.

Copyright © 2026 Jan Mróz | Procedural Pixels