Mutexes can’t live with them, can’t do without them

This post is about my love-hate relationship with mutexes and my approach to managing the evils of mutexes. First we start with the assumption that your application requires multiple threads. Implied here is you have done a risk-benefits analysis that shows that using threads/RTOS is a net positive for your project. You just didn’t go down the RTOS path because that what we did yesterday or because RTOSes are cool. Second, it is given that you have to share data between threads. To share data safely (i.e. ensure data integrity and consistence) there are two basic options:

Mutexes.
Inter-Thread-Communication (ITC), e.g. message passing.

IME a multithreaded application uses both mutexes and ITC for sharing data safely. Mutexes are great in that they provide a simple mental model to for sharing data between threads. Lock the mutex, access the data, release the mutex. What could go wrong? Well…

I am assuming here that the reader has a basic understanding (or followed the links) of deadlock and priority inversion. I am skipping these introductory details because these topics are whole blog postings in themselves.

More on priority inversion. Most RTOSes provide a solution for this by implementing some variant of Priority Inheritance, where a lower priority thread that has acquired a mutex has its priority temporarily raised so it can execute, thereby not indefinitely blocking higher priority threads that are waiting to acquire the mutex. While this prevents the application from locking up, it can still violate your performance requirements of the higher priority threads because they can be blocked for relatively long periods of time while waiting for a lower priority thread (now executing at a higher priority) to release the mutex.

Another complicating factor contributing to why mutexes are bad, is that when deadlock or priority inversion occur – it is almost always the result of some edge case or race condition that is difficult to reproduce on demand. If the Mutex error(s) occur on the happy path – you would have found and fixed them already as part of your unit testing 😉.

To Use Mutexes Safely:

My recommendation for avoiding deadlock and priority inversion when using mutexes are:

Do not directly or indirectly acquire another Mutex.
The duration that the mutex is locked is well bounded and deterministic. For example, avoid things like callbacks into the application or long IO operations while a mutex is locked.
Do the bare minimum work while the mutex is locked. Somewhat analogous to the strategy of keeping your interrupt service routines (ISRs) as short as possible.
Mutexes 101 – always have unlock operation for every lock operation. In theory unit testing and code reviews should catch this – but IME, reality and theory are not always same thing, i.e. this type of error still gets into production code.

Where Does it Go Wrong?

The guidelines are simply enough. So how do I get in trouble with mutexes? I’ll start a module using a mutex because the original logic is simple and meets the above guidelines . And because the usage is simple – I skip the doing any analysis or what ifs about future mods to the module. Then the module is refactored and a function that has acquired a Mutex calls another function– which then calls another function – which then violates one of the above guidelines. I have earned this merit badge more than once – and not just early in my career. The allure of mutexes is due to the simplicity of use. It way too easy to ignore the nuisances, details, and baggage that come along with mutexes.

After having suffered through too many Mutex issues, I now adopt the following policy:

Default design of a sub-system, module, class, etc. is for a single threaded model. IME large portions of a multi-threaded application does not need to be thread aware or thread safe. If you don’t have mutex, you can’t have a mutex error. Don’t burden a design with thread safety if it not absolutely needed. For example, in my CPL C++ Class library I have set of container classes (Lists, Dictionaries, etc.). Almost all of these classes are not thread safe. That is because if/when thread safety is needed – it is provided by the sub-system/module/class that uses the containers. In addition, when thread safety is needed – it is not always done by using a mutex, e.g. uses ITC message passing instead.
When thread safety is needed:
- Minimize the amount of code that requires thread safety.
- Do not immediately resort to a Mutex. Do your due diligence on the best approach to thread safety. This means performing the analysis/thought-experiment of how future changes may potentially violate the guidelines above.
Only use mutexes when their use can be fully encapsulated inside a module or class. For example if your design has a function that locks a mutex and then that function calls a callback function (outside of the module) before releasing the mutex – you have introduce a latent bug and it just a matter of time before your users find it. This is because at some point that callback function is going to violate one of the guidelines above.
Pay extra attention when refactoring a sub-system/module/class that uses mutexes. You need to perform the due diligence each time you refactor to ensure that haven’t violated on the guidelines.

Summary

Mutexes are bad, but we need them. Practice safe mutexing 😉.

Patterns in the Machine