Multi-core Performance Improvements

For the last few weeks things might have appeared quiet with Cantabile 3, but in fact I’ve been really busy. Mostly with GUI work for the OS-X port (which I’ll write about later), but also I’ve been performance tuning the audio engine for multi-core processors.

The Challenge

Getting Cantabile’s audio engine to run efficiently on multi-core machines is a challenging problem. At its heart the audio engine is a connected object graph — a giant tree of components wired together to produce whatever it is that the user has configured in the session. These components include things like audio mixers, midi muxers, peak level readers, media players, plugin hosting components and many others.

Each object in the graph has a set of precedent objects — objects that must be executed before the dependant object can be executed. For example the audio input mixer for a plugin must be executed before the plugin itself can be processed.

Up until now, Cantabile 3’s approach to executing this graph was to put all objects with no precedents into a queue, wake up all the worker threads which would then:

Pull one object from the queue
Execute that item
Put any dependants that had no futher pending precedents onto the end of the queue
Repeat if there was something else in the queue
Otherwise, sleep for a bit, wake up, and check again for something to do

It’s a very simple algorithm and it works, but it suffers from two significant problems — excessive locking and busy sleeping.

Excessive Locking and Lock Contention

In previous posts I’ve said that Cantabile’s audio engine is completely lock free — which in terms of interactions with code outside the audio engine it is. Internally however there’s one lock that used to manage access to the above mentioned queue when processing in multi-threaded mode. (When multi-core mode is disabled no lock is used).

The problem with the above algorithm is that each access to the queue or any of the elements in the queue requires taking the lock. For small object graphs this doesn’t really present a problem, but as the object graph grows (a session with 64 plugins has about 500 objects in the graph) the locks start to consume a reasonable percentage of the audio cycle time.

Worse, lock contention issues start to arise — where multiple threads are fighting for the lock and end up stalling each other.

Busy Waiting

The other problem with the old approach is that in order to get the worker threads to respond quickly to new work placed in the queue they would “busy” wait.

That is, rather than properly sleeping and letting the operating system re-awaken it later, it would simply spin in a loop waiting for work to become available. It wasn’t a tight loop, was only while the audio cycle was being processed and it regularly yielded the time slice back to the OS — but less than ideal.

The problem with this busy waiting is that it consumes CPU resources that could be used by multi-core plugins or other software running on the machine.

The Solution

After spending a couple of days trying (and failing) to tune these issues away it became apparent it needed a new strategy. The biggest problem is that every object in the graph requires taking the lock at least once just to remove it from the queue.

I toyed with the idea of using lock-free ring buffers but these generally impose single-reader/single-writer constraints — which doesn’t fit this scenario. I also experimented with grouping parts of the object graph into sections that could be executed as a set, but this imposes and additional burden on the code that builds the object graph — a burden I’d prefer live without. I even tried automatically generating these groupings but this was too slow.

Finally, I came up with a new strategy that so far seems to be working:

Each object in the graph is given a “weight” — a number that represents the approximate load to execute the item. Currently plugins are given a load of 100 while everything else gets 1.
A maximum queue weight is calculated by dividing the total graph weight by the number of CPU cores.

Each worker thread then:

Takes the lock
Pulls as many objects from the queue as it can without exceeding the maximum queue weight and places them into a local queue
Releases the lock
Executes the next object in the local queue
Releases any dependants objects whose pending precedent count have reached zero.
If a released dependant can fit in the current local queue without exceeding the maximium queue weight it’s appended to the local queue and will be executed by this thread soon — possibly without ever retaking the lock.
Otherwise, the lock is taken and it’s put back in the master queue and another worker thread is awoken.
Repeat from 4 until there’s nothing left to do.
Busy wait for a little bit, then properly sleep until woken.
Repeat from 1 until the entire object graph has been executed.

In other words, each worker thread tries to pull as much work as it can from the master queue with just one lock but not so much that there’s nothing for other worker threads to help with. This approach can also process long chains of objects without having to retake the lock — a reasonably common pattern in Cantabile’s object graphs.

The other part of the solution was to replace the busy waiting with a less aggressive algorithm that busy waits for a short period but then fairly quickly gives up and yields back to the OS. Since each awakening performs more work anyway, the cost of a full sleep/wake become proportionally lower.

Of course the actual implementation is more complex but that’s the general idea.

So how does it perform?

My benchmark for this has been a session with 64 plugin instances (an object graph of just under 500 objects).

Before these improvements each thread was taking about 70–80 locks per audio cycle — way too many. After these changes it’s a much lower 7–10 for the primary audio thread and maybe 10–15 for the other five threads. It tends to total around 60 per audio cycle which tracks nicely with the number of plugins.

In practice this translates to a drop from about 12% to 3% when idle and 40% down to about 25% when playing. Also, the load seems more stable. Not bad!

Available now — if you’re keen!

These changes are all available now in build 3062 — but, as you can see this was not a trivial change and I’m still testing it. If you do find any issues, please let me know.