Fault Tolerance Through Supervisor Hierarchies

The "let it crash" approach to fault/error handling, implemented by linking actors, is very different to what Java and most non-concurrency oriented languages/frameworks have adopted. It’s a way of dealing with failure that is designed for concurrent and distributed systems.

If we look at concurrency first. Throwing an exception in concurrent code, let’s assume we are using non-linked actors, will just simply blow up the thread that currently executes the actor.

  1. There is no way to find out that things went wrong (apart from see the stack trace).
  2. There is nothing you can do about it.

Here actors provide a clean way of getting notification of the error and do something about it.

Linking actors also allow you to create sets of actors where you can be sure that either:
  1. All are dead
  2. None are dead

This is very useful when you have thousands of concurrent actors. Some actors might have implicit dependencies and together implement a service, computation, user session etc.

It encourages non-defensive programming. Don’t try to prevent things from go wrong, because they will, whether you want it or not. Instead; expect failure as a natural state in the life-cycle of your app, crash early and let someone else (that sees the whole picture), deal with it.

Now let’s look at distributed actors. You can’t build a fault-tolerant system with just one single box you need at least two. Also, you (usually) need to know if one box is down and/or the service you are talking to on the other box is down. Here actor supervision/linking is a critical tool for not only monitoring the health of remote services, but to actually manage the service, do something about the problem if the actor or node is down. This could be restarting him on the same node or on another node.

In short, it is a very different way of thinking but a way that is very useful (if not critical) to building fault-tolerant highly concurrent and distributed applications, which is as valid if you are writing apps for the JVM or the Erlang VM (the origin of the idea of "let-it-crash" and actor supervisor).


Supervisor hierarchies originate from Erlang’s OTP framework.

A supervisor is responsible for starting, stopping and monitoring its child processes. The basic idea of a supervisor is that it should keep its child processes alive by restarting them when necessary. This makes for a completely different view on how to write fault-tolerant servers. Instead of trying all things possible to prevent an error from happening, this approach embraces failure. It shifts the view to look at errors as something natural and something that will happen and instead of trying to prevent it; embraces it. Just ‘Let It Crash™’, since the components will be reset to a stable state and restarted upon failure.

Akka has two different restart strategies; All-For-One and One-For-One. Best explained using some pictures (referenced from ):


The OneForOne fault handler will restart only the component that has crashed.
external image sup4.gif


The AllForOne fault handler will restart all the components that the supervisor is managing, including the one that have crashed. This strategy should be used when you have a certain set of components that are coupled in some way that if one is crashing they all need to be reset to a stable state before continuing.
external image sup5.gif

Restart callbacks

There are two different callbacks that the Active Object and Actor can hook in to:
  • Pre restart
  • Post restart

These are called prior to and after the restart upon failure and can be used to clean up and reset/reinitialize state upon restart. This is important in order to reset the component failure and leave the component in a fresh and stable state before consuming further messages.

Defining a supervisor's restart strategy

Both the Active Object supervisor configuration and the Actor supervisor configuration take a ‘RestartStrategy’ instance which defines the fault management. The different strategies are:
  • AllForOne
  • OneForOne
These have the semantics outlined in the section above.

Here is an example of how to define a restart strategy:
  AllForOne(), // restart policy (AllForOne or OneForOne)
  3,           // maximum number of restart retries
  5000         // within time in millis

Defining actor life-cycle

The other common configuration element is the ‘LifeCycle’ which defines the life-cycle. The supervised actor can define one of two different life-cycle configurations:
  • Permanent: which means that the actor will always be restarted.
  • Temporary: which means that the actor will not be restarted, but it will be shut down through the regular shutdown process so the 'shutdown' callback function will called.

Here is an example of how to define the life-cycle:
  Permanent(), // 'Permanent()' means that the component will always be restarted
               // 'Temporary()' means that it will not be restarted, but it will be shut
               // down through the regular shutdown process so the 'shutdown' hook will called

Supervising Actors

Declarative supervisor configuration

The Actor’s supervision can be declaratively defined by creating a ‘SupervisorFactory’. Here is an example:
object factory extends SupervisorFactory(
    RestartStrategy(AllForOne, 3, 1000, List(classOf[Exception])),
      new MyActor1,
      new MyActor2,
    :: Nil))
val supervisor = factory.newInstance
supervisor.start // will link and start up all actors

Declaratively define actors as remote services

You can declaratively define an actor to be available as a remote actor using teh 'RemoteActor(hostname: String, port: Int)' config element.

Here is an example:
object factory extends SupervisorFactory(
    RestartStrategy(AllForOne, 3, 1000, List(classOf[Exception])),
      new MyActor1,
      RemoteAddress("localhost", 9999))
    :: Nil))

Programatical linking and supervision of Actors

Actors can at runtime create, spawn, link and supervise other actors. Linking is done using one of the link methods available in the Actor itself.

Here is the API:
// link and unlink actors
// starts and links Actors atomically
// spawns (creates and starts) actors
// spawns and links Actors atomically

The supervising actor's side of things

If a linked Actor is failing and throws an exception then an ‘Exit(deadActor, cause)’ message will be sent to the parent (however you should never try to catch this message in your own message handler, it is managed by the runtime).

If the parent wants to be able to do something with this message, e.g. be able restart the linked Actor according to the predefined fault handling schemes then it has to set its ‘trapExit’ flag to a list of Exceptions that he wants to be able to trap.
class MyActor extends Actor {
  trapExit = List(classOf[MyException], classOf[IOException])
  ... // implementation omitted

The supervising Actor also needs to define a fault handler that defines the restart strategy the Actor should accommodate when it traps an ‘Exit’ message. This is done by setting the ‘faultHandler’ field.
protected var faultHandler: Option[FaultHandlingStrategy] = None

The different options are:
  • AllForOneStrategy(maxNrOfRetries, withinTimeRange)
  • OneForOneStrategy(maxNrOfRetries, withinTimeRange)

Here is an example:
faultHandler = Some(AllForOneStrategy(3, 1000))

The supervised actor's side of things

The supervised actor needs to define a life-cycle. This is done by setting the lifeCycle field as follows:
lifeCycle = Some(LifeCycle(Permanent)) // Permanent or Temporary

In the supervised Actor you can override the ‘preRestart’ and ‘postRestart’ callback methods to add hooks into the restart process. These methods take the reason for the failure, e.g. the exception that caused termination and restart of the actor as argument. It is in these methods that you have to add code to do cleanup before termination and initialization after restart. Here is an example:
override def preRestart(reason: Throwable) = {
  ... // clean up before restart
override def postRestart(reason: Throwable) {
  ... // reinit stable state after restart

Supervising Active Objects

Declarative supervisor configuration

To configure Active Objects for supervision you have to consult the ‘ActiveObjectManager’ and its ‘configure’ method. This method takes a ‘RestartStrategy’ and an array of ‘Component’ definitions defining the Active Objects and their ‘LifeCycle’. Finally you call the ‘supervise’ method start everything up. The Java configuration elements reside in the ‘se.scalablesolutions.akka.config.JavaConfig’ class and need to be imported statically.

Here is an example:
import static se.scalablesolutions.akka.config.JavaConfig.*;
private ActiveObjectManager manager = new ActiveObjectManager();
  new RestartStrategy(new AllForOne(), 3, 1000, new Class[]{Exception.class}),
    new Component[] {
      new Component(
        new LifeCycle(new Permanent()),
      new Component(
        new LifeCycle(new Permanent()),

Then you can retrieve the Active Object as follows:
Foo foo = manager.getInstance(Foo.class);

Restart callbacks

For Active Objects callbacks can be defined in two different ways. The first one is to use annotations. You can annotate a no-argument method void as return type with:
  • @se.scalablesolutions.akka.annotation.prerestart
  • @se.scalablesolutions.akka.annotation.postrestart

The methods can be arbitrarily named.
public void preRestart() {
  ... // clean up before restart
public void postRestart() {
  ... // reinit stable state after restart

Which will invoke these methods upon restart.

The second one is to define the names of these callback methods in the declarative supervisor configuration:
new Component(
  new LifeCycle(new Permanent()),
  new RestartCallbacks("preRestart", "postRestart")),
Home Turn Off "Getting Started"