Home

Blog

Bake AO

About

Contact

Home

Blog

Bake AO

About

Contact

Blog

Tutorial

Moving instance updates off the main thread

Sep 24, 2025

15 minutes

In the previous article (Rendering into custom texture in Unity), I created a rendering feature for character trails. Here, I'll profile and optimize its CPU-side rendering using the burst compiler.

___

How the feature works

This feature renders each game character's trail onto a top-down texture.

There is a HeatmapObjectRenderer component that represents the quad that is rendered into a texture:

:center-px:

Each game character has this component, making them act as brushes on the target texture:

___

Profiling

Let's profile the CPU side of the rendering and examine its performance. I will start by adding performance markers to the rendering code.

Adding profiling markers

I like to keep the markers in a class like this:

private static class Markers  
{  
  public static readonly ProfilerMarker updateInstances = new ProfilerMarker("HeatmapObjectRenderer.UpdateInstances");  
    
  // or like this, to make it future proof in case names change  
  public static readonly ProfilerMarker updateInstances = new ProfilerMarker($"{nameof(HeatmapObjectRenderer)}.{nameof(HeatmapObjectRenderer.UpdateInstances)}");  
  ...  
}

I added a profiler marker for updating properties of all instances in the HeatmapRendererFeature:

public class HeatmapRendererFeature : ScriptableRendererFeature  
{  
   ...  
     
   public override void AddRenderPasses(ScriptableRenderer renderer, ref RenderingData renderingData)  
   {  
       ...
       // Added this marker  
       using (Markers.updateInstances.Auto())  
       {  
           HeatmapObjectRenderer.UpdateInstances();  
       }
       ...  
   }  
...

And profiler markers in HeatmapRenderPass:

public unsafe class HeatmapRenderPass : ScriptableRenderPass  
{  
   ...  
   public override void RecordRenderGraph(RenderGraph renderGraph, ContextContainer frameData)  
   {  
    // Added this marker for a whole method  
       using (Markers.recordRenderGraph.Auto())  
       {  
           ...
           // Added this marker  
           using (Markers.setBufferData.Auto())  
           using (var computeBuilder = renderGraph.AddComputePass<PassData>($"{nameof(HeatmapRenderPass)}_Compute", out PassData passData))  
           {  
               ...  
           }  
           ...  
             
           // Added this marker  
           using (Markers.rasterPass.Auto())  
           using (var rasterBuilder = renderGraph.AddRasterRenderPass<PassData>($"{nameof(HeatmapRenderPass)}_Raster", out PassData passData))  
           {  
               ...  
           }  
       }  
   }  
   ...

Setting up the benchmark

To benchmark the performance, I set up 1000 HeatmapObjectRenderers in the scene.

:center-px:

I will run play mode and measure performance using Unity's profiler. After confirming everything works in the editor, I created a development build for maximum profiling accuracy with these settings:

Development build enabled
Script debugging disabled - it increases timings in the build when enabled
Deep profiling disabled - it is useful only for GC allocation debugging, but completely destroys the profiling times when enabled
Scripting backend: IL2CPP
C++ Compiler Configuration: Master - provides the best compile-time optimizations, but it takes some time to compile.

:center-px:

:image-description:

Build settings

:center-px:

:image-description:

Player settings

Collecting the profiling data

I created the build, ran it, and launched a profiler to collect the data. I collected some frames.

:center-px:

Now I want an overview of the markers' performance. Picking frames individually gives a poor overview, so I’ll use the Profile Analyzer, an official Unity package that adds an analysis tool for data from Unity’s profiler.

So I ran the tool on my markers, and this is what I got:

:center-px:

Updating instances is the most costly operation each frame. For 1000 instances, it takes 0.48ms on average and 0.81ms at max. On an i5-10400F, that's more time than I'm comfortable with.

See, game engines should handle rendering tasks like that without a dent. If I want to have 1000 moving objects interacting with the ground, I'd better use this time for gameplay logic, not drawing quads into a texture. For my standard, updating 1000 objects should take ~0.05ms, not 0.47ms.

Let's optimize instance updates!

___

Analyzing instance updates implementation

Ok, now that I know I need to optimize instance updates, let's analyze its source code.

private void UpdateData()  
{  
	if (dataPtr == null)  
		return;
	
	// Fill in the data if it is allocated  
	dataPtr->alpha = alpha;  
	dataPtr->blendMode = (int)blendMode;  
	dataPtr->localToWorldMatrix = transform.localToWorldMatrix;  
}
	
public static void UpdateInstances()  
{  
	// Iterate through each instance  
	for (int i = 0; i < allInstances.Count; i++)  
		allInstances[i].UpdateData();  
}

This is a full profiled source code. Iterating over each instance in the list and then updating its properties.

Ok, so let's focus on the most costly operations here:

private void UpdateData()  
{  
	if (dataPtr == null)  
		return;
	
	dataPtr->alpha = alpha; // Just copy the value  
	dataPtr->blendMode = (int)blendMode; // Just copy the value  
	dataPtr->localToWorldMatrix = transform.localToWorldMatrix; // Create a local-to-world matrix  
}
public static void UpdateInstances()  
{  
	for (int i = 0; i < allInstances.Count; i++) // Iterate each element  
		allInstances[i].UpdateData();  
}

Most of that involves iterating each instance and assigning the data, with one caveat: transform.localToWorldMatrix is a property, not a field. Let's check what it does by decompiling the C# code in the IDE:

//  
// Summary:  
//     Matrix that transforms a point from local space into world space (Read Only).  
public Matrix4x4 localToWorldMatrix  
{  
	get  
	{  
		IntPtr intPtr = MarshalledUnityObject.MarshalNotNull(this);  
		if (intPtr == (IntPtr)0)  
		{  
			ThrowHelper.ThrowNullReferenceException(this);  
		}
		
		get_localToWorldMatrix_Injected(intPtr, out var ret);  
		return ret;  
	}  
}
[MethodImpl(MethodImplOptions.InternalCall)]  
private static extern void get_localToWorldMatrix_Injected(IntPtr _unity_self, out Matrix4x4 ret);

Accessing transform.localToWorldMatrix causes C# to call a native engine code to generate the matrix. Field property and intercepting with the native engine each time make this code costly.

To optimize it, I need to make creating this matrix faster...

___

Analyzing the thread work

To gain some context, I will review what this workload looks like in the threads overview.

:center-px:

All instances are updated on the main thread, then Unity performs some actions, and finally, the instance data is utilized. This is how I interpret this profiling data:

I don't like that the main thread is updating the instances, while other threads do nothing. I would like to update the instances on another thread while Unity does Unity stuff. Offloading the updates to the other thread, I should be able to get rid of the instance updates completely from the main thread. And this is my goal:

In this case, I would need to access transform.localToWorldMatrix on another thread. However, unity prevents the transform properties from being accessed from other threads... I will need to find a workaround for that!

___

Analyzing optimization options

Those are the possible options I see:

Don't use transforms. Store the heatmap renderer in the code, not as a component in the hierarchy. It would allow me to update the position directly from the game's code. However, it would make working with this feature less convenient in the editor, and it hides the cost of instance updates.
Or I could use the Job system and Burst compiler to access a transform in another thread with native-compiled code. Code compiled with the Burst compiler will work faster, especially when executed on another thread, allowing Unity to do the stuff in the meantime.

I prefer the second idea (jobs + burst) for a few reasons:

It allows me to use the components on the game objects, which better aligns with the Unity Editor workflow.
All the updates will be in a single place in the code, making profiling and managing the code easier.
It better separates the feature from the project's code - potentially allowing me to use it in another project.

___

Planning the optimization

I want to use the burst compiler to optimize this. Burst compiles a subset of C# into native, highly optimized assembly, often running 5-20x faster than C#. The caveat is that it can't compile managed code like classes and C# collections. My HeatmapObjectRenderer already stores its data in an unmanaged collection, so that's not an issue. However, my code uses the `transform` property, which is managed. Also, Unity doesn't allow accessing `transform` properties from another thread...

So I will search for a way to access transformed data in unmanaged code using other threads.

After a while, I was able to find that there is an IJobParallelForTransformExtensions.ScheduleByRef, whose description is:

:center-px:

:image-description:

Looks like something that I will use - a Job that allows multithreaded unmanaged access to transforms - Perfect!

By the documentation, I need to create a TransformAccessArray object, which stores all the transforms I need to access.

However, the`TransformAccessArray documentation is not very descriptive...

:center-px:

:image-description:

Thanks Unity, that's an awesome documentation... XD

Examining the fields, methods, and name suggests it is likely an array or list internally, allowing elements to be added with `Add` and removed with `RemoveAtSwapBack`.

:center-px:

I also keep my instances in a list already, so if I keep the TransformAccessArray in sync with the instances, I will be able to create a Job that will update instances on another thread.

So my implementation plan looks like:

Store transforms of all renderer instances in one TransformAccessArray.
When updating all transforms, start a IJobParallelForTransform using the array. The job will update the data for unmanaged renderers. Before using the renderer's data, ensure the job is completed.

___

Optimization

Now I will go through each implementation step.

1. Store transforms in one managed array

Currently, all the managed renderers components are stored in the managed list, and their data in the unmanaged list:

// Collections that track all the renderers. Elements IDs must be in sync.  
private static List<HeatmapObjectRenderer> allInstances = new();  
private static UnsafePtrList<HeatmapObjectRendererData>* AllInstanceData = null;

So I need to add a TransformAccessArray object that will store all the transforms!

private static List<HeatmapObjectRenderer> allInstances = new();  
private static TransformAccessArray allInstanceTransforms = default; // Added this one  
private static UnsafePtrList<HeatmapObjectRendererData>* AllInstanceData = null;

Let's use that in the code. I commented out the lines I modified.

public unsafe class HeatmapObjectRenderer : MonoBehaviour  
{  
	private static List<HeatmapObjectRenderer> allInstances = new();  
	// Added the collection of transforms  
	private static TransformAccessArray allInstanceTransforms = default;  
	private static UnsafePtrList<HeatmapObjectRendererData>* AllInstanceData = null;
	
	...
	
	private void OnEnable()  
	{  
		...
		
		if (AllInstanceData == null)  
		{  
			AllInstanceData = UnsafePtrList<HeatmapObjectRendererData>.Create(1024, Allocator.Persistent);  
			
			// Added lazy initialization of transform access array  
			allInstanceTransforms = new TransformAccessArray(1024);  
		}
		
		allInstances.Add(this);  
		allInstanceTransforms.Add(transform); // Adding transform to the array  
		AllInstanceData->Add(dataPtr);  
	}
	
	private void OnDisable()  
	{  
		// I modified the code that removes elements to always use the RemoveAtSwapBack method in all collections, keeping them all in sync.  
		int index = allInstances.IndexOf(this);  
		allInstanceTransforms.RemoveAtSwapBack(index);  
		allInstances.RemoveAtSwapBack(index);  
		AllInstanceData->RemoveAtSwapBack(index);
		
		...
		
		if (AllInstanceData->Length <= 0)  
		{  
			UnsafePtrList<HeatmapObjectRendererData>.Destroy(AllInstanceData);  
			AllInstanceData = null;  
			
			// Lazy-disposing of the transform array  
			allInstanceTransforms.Dispose();  
			allInstanceTransforms = default;  
		}  
	}  
...

Now I have a synced list of instances and their transforms. Time to write a job.

2. A job that updates the matrices

I will begin by writing the code for the job. It will iterate over each transform and update the unmanaged renderer data. This is the code I got:

[BurstCompile(OptimizeFor = OptimizeFor.Performance)] // Ensure burst is optimizing this for performance  
public struct UpdateInstanceMatricesJob : IJobParallelForTransform  
{  
	[NativeDisableUnsafePtrRestriction]  
	public UnsafePtrList<HeatmapObjectRendererData>* data;
	
	public UpdateInstanceMatricesJob(UnsafePtrList<HeatmapObjectRendererData>* data)  
	{  
		this.data = data;  
	}
	
	// This method is executed for each element in the array.  
	public void Execute(int index, TransformAccess transform)  
	{  
		data->ElementAt(index)->localToWorldMatrix = transform.localToWorldMatrix;  
	}  
}

Then, I will make the UpdateData function of the renderer NOT update the matrix:

private void UpdateData()  
{  
	if (dataPtr == null)  
		return;
	
	// Fill the data if it is allocated  
	dataPtr->alpha = alpha;  
	dataPtr->blendMode = (int)blendMode;  
}

Then, I need to start a job that updates the matrices. I will create a job and schedule it to start immediately.

// Store the JobHandle  
public static JobHandle updateMatricesJobHandle = default;
	
public static void UpdateInstances()  
{  
	// I removed updating the properties in the managed code here.  
	
	// Update matrices using a job  
	if (allInstances.Count > 0)  
	{  
		UpdateInstanceMatricesJob job = new UpdateInstanceMatricesJob(AllInstanceData); // Create a job  
		updateMatricesJobHandle = job.ScheduleByRef(allInstanceTransforms); // Schedule the job  
		JobHandle.ScheduleBatchedJobs(); // Ensure the job starts immediately  
	}  
}

After the above method is called, the updating starts on another thread, so I need to have a way to ensure it is completed. This method will do:

private static void EnsureMatricesUpdated()  
{  
	// Checking if job handle was created  
	if (!updateMatricesJobHandle.Equals(default))  
	{
		// Ensure that the waiting for another thread is properly profiled  
		using (Markers.waitingForInstancesToUpdate.Auto())  
		{
			// Wait for the job to complete.  
			updateMatricesJobHandle.Complete();  
			
			// Forget the job handle.  
			updateMatricesJobHandle = default;  
		}
	}
}

During the rendering, I fetch the data, so I need to ensure that the job is completed before the data is fetched. I will use the method I prepared above.

public static void FetchInstanceData(UnsafeList<HeatmapObjectRendererData>* targetListPtr)  
{  
	// Ensure the update job is completed before accessing the data  
	EnsureMatricesUpdated();
	
	for (int i = 0; i < AllInstanceData->Length; i++)  
	targetListPtr->Add(*(AllInstanceData->ElementAt(i)));  
}

3. Fixing code issues

Ok, I removed the usage of the UpdateData() method completely, because matrices are set on another thread. However, I also used this method to update the `alpha` and the blendMode fields of unmanaged renderer data. I need to ensure that those are updated correctly after changes in the code. I modified the UpdateData() method to update only those two fields:

private void UpdateData()  
{  
	if (dataPtr == null)  
		return;
	
	// Fill the data if it is allocated. Don't update the matrix here.  
	dataPtr->alpha = alpha;  
	dataPtr->blendMode = (int)blendMode;  
}

Then I need to ensure that this is called each time someone modifies the managed fields.

So I changed this:

[Range(0.0f, 1.0f)] public float alpha;  
public HeatmapObjectBlendMode blendMode;

Into this:

// Into this:  
[Range(0.0f, 1.0f)] [SerializeField] private float alpha;  
[SerializeField] private HeatmapObjectBlendMode blendMode;
// Alpha is a property now  
public float Alpha  
{  
   get => alpha;  
   set { alpha = value; UpdateData(); } // When set, it updates internal unmanaged data  
}
// BlendMode is a property now  
public HeatmapObjectBlendMode BlendMode  
{  
   get => blendMode;  
   set { blendMode = value; UpdateData(); } // When set, it updates internal unmanaged data  
}

I also ensured that the data is properly updated when those values are modified through the inspector window:

private void OnValidate()  
{  
	UpdateData();  
}

___

Second profiling round

Now that the code is ready, it's time to profile! I ensured that the feature works correctly in the editor and then created a build using the same settings as in the previous build.

:center-px:

So, the median time of updating 1000 instances reduced from 0.47ms to 0.02ms on the main thread. It is expected because I moved the whole workload to another thread.

Let's see the other thread:

:center-px:

Work happens on another thread.

0.08ms median time and 0.27ms max on another thread sounds much better than 0.47ms median and 0.81ms max on the main thread.

And my implementation doesn't stall the main thread at all.

:center-px:

No main thread stalls.

This optimization enables me to render many more objects that interact with the ground. When I tried 10,000 objects, the performance in the build remained smooth.

:center-px:

Performance of the CPU when rendering 10000 objects. Median time was ~0.32ms per frame, mainly copying the data into a graphics buffer.

And performance of the GPU of rendering 10000 objects:

:center-px:

0.35ms to render 10000 instanced quads into 4K texture on RTX 3060.

___

Summary

The initial implementation of heatmap rendering suffered from costly per-instance updates, with transform.localToWorldMatrix calls dominating CPU time on the main thread (0.47ms median for 1000 instances). By moving matrix updates into an IJobParallelForTransform scheduled with Burst, the workload was shifted off the main thread, reducing the main-thread cost to 0.02ms and the parallel job cost to 0.08ms (median).

Now it is a nice and scalable solution for rendering quads into a texture.

___