Cantabile Development Blog

Follow the development of music software Cantabile

Articles | Full Index | RSS Feed


The Downside of Lock-Free Programming

This is a long winded and technical post about a particularly nasty bug I've been chasing for over two days now. Mostly I've written this out for future reference for myself, but it might also be interesting to others.

For the last few weeks I've noticed that Cantabile's internal diagnostic checks were detecting something wrong after running Cantabile for a random amount of time. I'd been putting off fixing it while I work on the new UI, but decided I couldn't ignore it any longer. The reported error was a corruption in Cantabile's management of audio buffers - the audio buffer pool. The cause turned out to be a trivially simple logic error in a lock-free stack algorithm.

Typically when writing multi-threaded code, programmers use locks to ensure that only one thread is read or writing a particular piece of data at a time. This works well except that taking and releasing locks can be slow - particularly when two threads take the lock at the same time.

Lock-free programming is a technique for writing thread-safe code that doesn't use traditional locking mechanisms. Instead algorithms are designed to ensure that data structures can be manipulated in a safe way - usually with certain restrictions - without using these locks. The big advantage of lock-free algorithms is they are extremely efficient.

The downside is the difficulty in writing and debugging them. As mentioned, I've spent two days trying to nail down this one fault and it turned out to be a minor mistake introduced while porting Cantabile to x64. Although diagnostic code in Cantabile picked up that something was wrong this was not an easy fault to find:

  • It occurred completely randomly - sometimes almost immediately on starting Cantabile, sometimes after running for over 10 minutes.
  • Adding more diagnostic code changed the timing enough to effectively eliminate the fault - or at least make it rare enough to hardly ever see it.
  • Unit testing didn't pick up the fault. Because I suspected there was a fault in the audio buffer pool, I wrote a set of unit tests that stress tested that piece of code under very heavy load. Although it tested every situation I could think of it simply missed this fault completely.
  • Since the unit testing indicated there were no faults in the buffer pool, I figured in must be a logic error in the code the uses the buffer pool. I've therefore missed the fault and gone on spending many hours testing Cantabile's audio mixing, master bus, recorders and various other pieces of code.
  • Eventually, and just through luck, I managed to insert the appropriate set of validation code that proved the fault must be somewhere in the buffer pool.
  • After replacing the lock-free stack algorithm with the one from Cantabile 1.2 suddenly the fault went away. Yet, even closely comparing both versions of the code it was still not obvious why one worked and one didn't.

Finally it clicked... (technical explanation follows)

In porting Cantabile to x64 I needed to make changes to support 64-bit lock free algorithms. Due to the x64 compiler not supporting assembler I decided to switch to intrinsic functions supplied by the compiler, replacing the assembler routines from 1.2.

Here's the original working code from 1.2:

CAS was my original interlocked compare and exchange function which returns true if the exchange was successful.

Here's the new version, containing the fault:

_InterlockedCompareExchangePointer performs the same function as CAS except it returns the final value of the destination, not a flag indicating if the exchange was successful. To check if the exchange occurred you simply compare the return value to the comparand parameter. Looks identical right?

Wrong. The problem is the fact that the value of "pEntry->m_pNext" can be changed by other threads after calling InterlockedCompareExchange but before the comparison is made. The fix:

And herein lies the difficulty with lock-free programming. In theory the above two versions of this function should behave identically, but in practice, the other threads complicate things. With typical multi-threaded programming, the whole code segment would be protected with a lock and the problem wouldn't occur. With lock-free you need to be so careful.

Having said all that, I'm still a big fan of lock-free algorithms. Their efficiency makes it all worth while... I think.

Posted on October 6, 2007

Share This

Posted on October 6, 2007

www.GuruOne.biz says:

Just a reminder ;)

GOMPy is KING of midi manipulation

Cheers

Rich

Leave A Comment

All comments will be reviewed for spam before being displayed.