Tutorial
Moving instance updates off the main thread
Sep 24, 2025
15 minutes
In the previous article (Rendering into custom texture in Unity), I created a rendering feature for character trails. Here, I'll profile and optimize its CPU-side rendering using the burst compiler.
___
How the feature works
This feature renders each game character's trail onto a top-down texture.
There is a HeatmapObjectRenderer
component that represents the quad that is rendered into a texture:
:center-px:

:image-description:
Each game character has this component, making them act as brushes on the target texture:
___
Profiling
Let's profile the CPU side of the rendering and examine its performance. I will start by adding performance markers to the rendering code.
Adding profiling markers
I like to keep the markers in a class like this:
I added a profiler marker for updating properties of all instances in the HeatmapRendererFeature
:
And profiler markers in HeatmapRenderPass
:
Setting up the benchmark
To benchmark the performance, I set up 1000 HeatmapObjectRenderers
in the scene.
:center-px:

I will run play mode and measure performance using Unity's profiler. After confirming everything works in the editor, I created a development build for maximum profiling accuracy with these settings:
Development build enabled
Script debugging disabled - it increases timings in the build when enabled
Deep profiling disabled - it is useful only for GC allocation debugging, but completely destroys the profiling times when enabled
Scripting backend: IL2CPP
C++ Compiler Configuration: Master - provides the best compile-time optimizations, but it takes some time to compile.
:center-px:

:image-description:
Build settings
:center-px:

:image-description:
Player settings
Collecting the profiling data
I created the build, ran it, and launched a profiler to collect the data. I collected some frames.
:center-px:

Now I want an overview of the markers' performance. Picking frames individually gives a poor overview, so I’ll use the Profile Analyzer, an official Unity package that adds an analysis tool for data from Unity’s profiler.

So I ran the tool on my markers, and this is what I got:
:center-px:

Updating instances is the most costly operation each frame. For 1000 instances, it takes 0.48ms on average and 0.81ms at max. On an i5-10400F, that's more time than I'm comfortable with.
See, game engines should handle rendering tasks like that without a dent. If I want to have 1000 moving objects interacting with the ground, I'd better use this time for gameplay logic, not drawing quads into a texture. For my standard, updating 1000 objects should take ~0.05ms, not 0.47ms.
Let's optimize instance updates!
___
Analyzing instance updates implementation
Ok, now that I know I need to optimize instance updates, let's analyze its source code.
This is a fully profiled source code. Iterating over each instance in the list and then updating its properties.
Ok, so let's focus on the most costly operations here:
Most of that involves iterating each instance and assigning the data, with one caveat: transform.localToWorldMatrix
is a property, not a field. Let's check what it does by decompiling the C# code in the IDE:
Accessing transform.localToWorldMatrix
causes C# to call a native engine code to generate the matrix. Field property and intercepting with the native engine each time make this code costly.
To optimize it, I need to make creating this matrix faster...
___
Analyzing the thread work
To gain some context, I will review what this workload looks like in the threads overview.
:center-px:

All instances are updated on the main thread, then Unity performs some actions, and finally, the instance data is utilized. This is how I interpret this profiling data:

I don't like that the main thread is updating the instances, while other threads do nothing. I would like to update the instances on another thread while Unity does Unity stuff. Offloading the updates to the other thread, I should be able to get rid of the instance updates completely from the main thread. And this is my goal:

In this case, I would need to access transform.localToWorldMatrix
on another thread. However, unity prevents the transform properties from being accessed from other threads... I will need to find a workaround for that!
___
Analyzing optimization options
Those are the possible options I see:
Don't use transforms. Store the heatmap renderer in the code, not as a component in the hierarchy. It would allow me to update the position directly from the game's code. However, it would make working with this feature less convenient in the editor, and it hides the cost of instance updates.
Or I could use the Job system and Burst compiler to access a transform in another thread with native-compiled code. Code compiled with the Burst compiler will work faster, especially when executed on another thread, allowing Unity to do the stuff in the meantime.
I prefer the second idea (jobs + burst) for a few reasons:
It allows me to use the components on the game objects, which better aligns with the Unity Editor workflow.
All the updates will be in a single place in the code, making profiling and managing the code easier.
It better separates the feature from the project's code - potentially allowing me to use it in another project.
___
Planning the optimization
I want to use the burst compiler to optimize this. Burst compiles a subset of C# into native, highly optimized assembly, often running 5-20x faster than C#. The caveat is that it can't compile managed code like classes and C# collections. My HeatmapObjectRenderer
already stores its data in an unmanaged collection, so that's not an issue. However, my code uses the `transform` property, which is managed. Also, Unity doesn't allow accessing `transform` properties from another thread...
So I will search for a way to access transformed data in unmanaged code using other threads.
After a while, I was able to find that there is an IJobParallelForTransformExtensions.ScheduleByRef
, whose description is:
:center-px:

:image-description:
Looks like something that I will use - a Job that allows multithreaded unmanaged access to transforms - Perfect!
By the documentation, I need to create a TransformAccessArray
object, which stores all the transforms I need to access.

However, the`TransformAccessArray
documentation is not very descriptive...
:center-px:

:image-description:
Thanks Unity, that's an awesome documentation... XD
Examining the fields, methods, and name suggests it is likely an array or list internally, allowing elements to be added with `Add` and removed with `RemoveAtSwapBack`.
:center-px:

I also keep my instances in a list already, so if I keep the TransformAccessArray
in sync with the instances, I will be able to create a Job that will update instances on another thread.
So my implementation plan looks like:
Store transforms of all renderer instances in one
TransformAccessArray
.When updating all transforms, start a
IJobParallelForTransform
using the array. The job will update the data for unmanaged renderers. Before using the renderer's data, ensure the job is completed.
___
Optimization
Now I will go through each implementation step.
1. Store transforms in one managed array
Currently, all the managed renderers components are stored in the managed list, and their data in the unmanaged list:
So I need to add a TransformAccessArray
object that will store all the transforms!
Let's use that in the code. I commented out the lines I modified.
Now I have a synced list of instances and their transforms. Time to write a job.
2. A job that updates the matrices
I will begin by writing the code for the job. It will iterate over each transform and update the unmanaged renderer data. This is the code I got:
Then, I will make the UpdateData
function of the renderer NOT update the matrix:
Then, I need to start a job that updates the matrices. I will create a job and schedule it to start immediately.
After the above method is called, the updating starts on another thread, so I need to have a way to ensure it is completed. This method will do:
During the rendering, I fetch the data, so I need to ensure that the job is completed before the data is fetched. I will use the method I prepared above.
3. Fixing code issues
Ok, I removed the usage of the UpdateData()
method completely, because matrices are set on another thread. However, I also used this method to update the `alpha` and the blendMode
fields of unmanaged renderer data. I need to ensure that those are updated correctly after changes in the code. I modified the UpdateData()
method to update only those two fields:
Then I need to ensure that this is called each time someone modifies the managed fields.
So I changed this:
Into this:
I also ensured that the data is properly updated when those values are modified through the inspector window:
___
Second profiling round
Now that the code is ready, it's time to profile! I ensured that the feature works correctly in the editor and then created a build using the same settings as in the previous build.
:center-px:

So, the median time of updating 1000 instances reduced from **0.47ms to 0.02ms** on the main thread. It is expected because I moved the whole workload to another thread.
Let's see the other thread:
:center-px:

Work happens on another thread.
0.08ms median time and 0.27ms max on another thread sounds much better than 0.47ms median and 0.81ms max on the main thread.
And my implementation doesn't stall the main thread at all.
:center-px:

No main thread stalls.
This optimization enables me to render many more objects that interact with the ground. When I tried 10,000 objects, the performance in the build remained smooth.

:center-px:

Performance of the CPU when rendering 10000 objects. Median time was ~0.32ms per frame, mainly copying the data into a graphics buffer.
And performance of the GPU of rendering 10000 objects:
:center-px:

0.35ms to render 10000 instanced quads into 4K texture on RTX 3060.
___
Summary
The initial implementation of heatmap rendering suffered from costly per-instance updates, with transform.localToWorldMatrix
calls dominating CPU time on the main thread (0.47ms median for 1000 instances). By moving matrix updates into an IJobParallelForTransform
scheduled with Burst, the workload was shifted off the main thread, reducing the main-thread cost to 0.02ms and the parallel job cost to 0.08ms (median).
Now it is a nice and scalable solution for rendering quads into a texture.