I recently saw a review of the Doom 3 BFG edition source code. In particular, the multithreaded job system caught my attention because it’s something I’ve written in the past, and I wanted to revisit it for my current game projects. So naturally, I had to roll another one using a similar approach as id Tech 5, which puts strong emphasis on avoiding locks. Keep in mind that in many scenarios, performance loss from locking may be either nonexistent or not worth the hassle of writing lockless code.
I’d also read about the LMAX disruptor last year, which is the name of the architecture of a high-traffic financial platform. The novelty it brought was heavily optimizing a single business logic thread to run all requests rather than splitting the work among separate threads and synchronizing. The system moved data to/from the business logic thread using ring buffer pipelines. These are great for moving data across threads without involving locks, and I’ve made generous use of them in my design. LMAX also included an event sourcing model to support failover scenarios, and the model of a single thread processing requests in serial works well for deterministic behavior and replays.
For anyone unfamiliar with threading, locking is a way of ensuring that only a single thread of execution runs in a block of code (a “critical section”) at any given time. Locking involves calls to the operating system and causes other threads to halt execution if they attempt to execute the critical section while one thread claims it. On the other hand, lockless approaches attempt to keep threads running by attempting to write to variables using operations atomic at the hardware level. When write the attempt fails, the thread may “spin” and try again, or it may go off and perform some other work. It’s great for avoiding blocking, but unfortunately lockless approaches can quickly become nontrivial and difficult to understand. One approach or the other may be more appropriate depending on the expected levels of contention over shared resources.
Besides contention, another big concern with multithreaded programming is passing data from one thread to another. Immutable objects are great because we can pass them around without ever worrying about how they’ll be used, but depending on garbage collection concerns and object usage, mutability may be more appropriate. For example, we may have large buffers passed around and employ pooling to reuse the space to avoid memory allocations or garbage collection. In this case, we need to consider whether there are other references to the object and decide whether or not to transfer ownership, and thus responsibility for destroying or recycling the object, to the called method. We could also clone the object and guarantee that nothing else references the new object, but this may have its own overhead depending on the cost of copying.
In C#, an obvious choice is to use the Task Parallel Library and TPL Dataflow, but I needed a solution to work in Unity3D, which is still at the equivalent of .NET 3.5 and does not include these APIs, although I’ve seen TPL backported to 3.5. I’m also slightly concerned about how a large number of Task objects (which cannot be reused) would affect garbage collection times.
We want to avoid processing spikes on the main thread that cause frames to be dropped. One way to mitigate spikes is to smooth them out by limiting the amount of work allowed in a single frame. This requires that the task can be partially completed, and continuations are a convenient means of creating a state machine to support this. Another way is to offload tasks to background threads, and the purpose of the job/task system is to make that easy.
The basic steps:
- Partition app/game into subsystems that support N threads in parallel with little or no synchronization. For example, a map generation subsystem might be stateless and support any number of threads, but a world simulation subsystem may have many complex internal interactions and only support a single thread at a time.
- On the main thread, queue up jobs for each subsystem. While this step only involves appending a job callback and argument to a list, careful thought needs to go into the ownership of any mutable arguments passed to the job system.
- Once per frame on the main thread, submit the batch of queued jobs to another set of shared pending queues accessible by the background worker threads.
- As new batches become available, the worker threads copy them into their own thread-local storage, and from there the jobs can be ordered, traversed, and selected freely without worrying about modifying a shared collection of jobs.
- While jobs remain unfinished, the worker threads attempt to acquire a lock on a particular job (using atomic primitives), and if the job is already taken, they continue on to search for another.
- Once a job is obtained, the worker thread runs it and posts its output value to an output queue readable by the main thread.
- When no jobs remain in local storage or in the pending queues, the worker threads sleep and wait for a signal to notify them that jobs are available again.
- Once per frame on the main thread, output queue entries are consumed and completion callbacks are called if the job was not canceled.
Many of the queues are implemented using a ring buffer with an overflow list. Ring buffers are allocated to a fixed power-of-two size and have head and tail pointers, with simple atomic operations to enqueue and dequeue that support concurrent access by a single producer and single consumer. Once capacity is reached, entries go into the overflow list, which does involve OS locks. Alternately, this could be implemented using a resizing buffer, but the code would be more complex and the intent is that a particular ring buffer size is chosen large enough to avoid reaching capacity under nominal loads.
There’s plenty more to talk about, but I’ll save it for another post.