Articles | Full Index | RSS Feed
This is a long winded and technical post about a particularly nasty bug I've been chasing for over two days now. Mostly I've written this out for future reference for myself, but it might also be interesting to others.
For the last few weeks I've noticed that Cantabile's internal diagnostic checks were detecting something wrong after running Cantabile for a random amount of time. I'd been putting off fixing it while I work on the new UI, but decided I couldn't ignore it any longer. The reported error was a corruption in Cantabile's management of audio buffers - the audio buffer pool. The cause turned out to be a trivially simple logic error in a lock-free stack algorithm.
Typically when writing multi-threaded code, programmers use locks to ensure that only one thread is read or writing a particular piece of data at a time. This works well except that taking and releasing locks can be slow - particularly when two threads take the lock at the same time.
Lock-free programming is a technique for writing thread-safe code that doesn't use traditional locking mechanisms. Instead algorithms are designed to ensure that data structures can be manipulated in a safe way - usually with certain restrictions - without using these locks. The big advantage of lock-free algorithms is they are extremely efficient.
The downside is the difficulty in writing and debugging them. As mentioned, I've spent two days trying to nail down this one fault and it turned out to be a minor mistake introduced while porting Cantabile to x64. Although diagnostic code in Cantabile picked up that something was wrong this was not an easy fault to find:
Finally it clicked... (technical explanation follows)
In porting Cantabile to x64 I needed to make changes to support 64-bit lock free algorithms. Due to the x64 compiler not supporting assembler I decided to switch to intrinsic functions supplied by the compiler, replacing the assembler routines from 1.2.
Here's the original working code from 1.2:
CAS was my original interlocked compare and exchange function which returns true if the exchange was successful.
Here's the new version, containing the fault:
_InterlockedCompareExchangePointer performs the same function as CAS except it returns the final value of the destination, not a flag indicating if the exchange was successful. To check if the exchange occurred you simply compare the return value to the comparand parameter. Looks identical right?
Wrong. The problem is the fact that the value of "pEntry->m_pNext" can be changed by other threads after calling InterlockedCompareExchange but before the comparison is made. The fix:
And herein lies the difficulty with lock-free programming. In theory the above two versions of this function should behave identically, but in practice, the other threads complicate things. With typical multi-threaded programming, the whole code segment would be protected with a lock and the problem wouldn't occur. With lock-free you need to be so careful.
Having said all that, I'm still a big fan of lock-free algorithms. Their efficiency makes it all worth while... I think.
Posted on October 6, 2007
www.GuruOne.biz says:
Just a reminder ;)
GOMPy is KING of midi manipulation
Cheers
Rich