On Error Handling

Fri, Jul 17, 2020 ❝On error handling - a general outline on case and error handling❞

Contents

A living document on the fundamentals of case and error handling. A general programming language-agnostic guideline for application development of all sorts. The article attempts to establish first principles that can be applied in any context where defined trade-offs give you the necessary adaptability to make it suitable to any situation.

Note In this document I use “case” and “error” interchangeably. Anything that’s not on the expected happy path is typically an alternative “case” and if this happens to be undesirable we call it an “error”.

Overview

We recognize two properties of errors:

Preventable: Can anything be done to prevent this error.
Acceptable: Can the application cope with the type of error.

Cases are classified in one of three types:

Bug (preventable, unacceptable/acceptable): programming mistake that becomes apparent through the unexpected situation you end up in. Either, illegal situation: fires an exception, or wrong result: successful execution but with unreliable result.
- To be eradicated
- Occurrence: either because of bugs inside the implementation logic (inside the function), or because of bad use by the using logic (outside the function).
- May be identified early using assertions.
- Should end in unexpected exception, i.e. unchecked exception.
- Failing immediately helps to identify that a critical, preventable error occurs.
Case/Error (unpreventable, acceptable): alternative case. Typically not the happy path as that is supported implicitly. It might be considered an error case, or an alternative control flow case. For example, race condition in communication protocol where the software has to adapt to external events.
- To be supported
- Occurrence: unpredictable event under accepted circumstances.
- Must be taken into account in the code, as it is a reasonable, to-be-expected case.
Failure (unpreventable, unacceptable): Typically things such as OS errors, memory unavailable, JVM malfunctioning, and all other exotic types of errors that you don’t want to take into account in your application.
- To be surrendered to
- Occurrence: unpredictable event that undermines accepted circumstances and that cannot be handled.
- Cannot be prevented. Cannot (or should not) be mitigated as there is risk of inconsistent application state or otherwise resulting in circumstances that are not desirable.

The classification is only strict in so far that it works for the project. It is possible to shift the boundaries between the classes. See section on trade-offs for consequences.

Assumptions

The underlying assumptions that are fundamental to the spirit of this article:

Programming errors should be eradicated as much as possible. Or, said differently, we aim for a program that is “as flawless as possible”.
There exists an upper bound to the types of failures we can eradicate. There is a limited amount of effort we can put in. Therefore we define a scope. Invariably, some types of failures may be out-of-scope.
Not all applications are the same. Not all applications require the same trade-offs and boundaries.

Handling

When an case/error occurs, we need to handle it. Handling is primarily concerned with current thread, i.e. the control flow in which the case or error occurs. We handle the case/error to be able to resume execution of the control flow.

The fundamental complexity to handling errors is to identify:

At what level in the control flow hierarchy to handle the error?
- Too low in the call hierarchy:
  You cannot do anything as the information needed to successfully mitigate the error is not available. You do not have access to sufficient application state to accomplish something meaningful.
- Too high in the call hierarchy:
  You can only do very broad actions that feel extreme/exaggerated for the particular error case. It works, but is far from ideal. For example, you experience a disconnect on the network. The only way left to mitigate is to throw away all user input and start from scratch.
How to handle the error, such that the outcome is most beneficial?
- Adapting the situation to the error case. This may fix the error completely. Then there is no need to do any logging.
- Mitigating the error case by canceling other related activities, may not fix the problem, but at least prevents other problems from happening as a consequence, i.e. prevent cascading errors.
  You may want to propagate the error after having performed your own part of mitigation.
- Logging will make the error insightful but does nothing to resolve the situation. (One will typically use INFO, WARNING or ERROR level logging.)

Note For logging we only consider levels INFO, WARNING and ERROR. Higher-granularity levels, such as DEBUG and TRACE, are used for different purposes.

Desirable side-effects

Apart from handling the error such that the control flow can resume execution, there may also be a need for “secondary” effects. One may want to log that a case/error happened, in addition to resuming execution. You may also want to signal other threads and/or processes to have them adapt to changed circumstances in the control flow of this thread.

Out-of-scope: `DEBUG`/`TRACE` levels

Log levels DEBUG and TRACE are out of scope for the chapter on error handling. These levels are used not to handle an error such that control flow can resume execution, but to leave breadcrumbs for the developers/users to find such that troubleshooting an unexpected situation can be simplified. At this point it becomes part of the desired control flow itself, and is no longer an exceptional situation.

DEBUG and TRACE level logging can be applied to any piece of logic. There is no requirement to restrict it specifically to case/error handling.

Trade-offs (deviations)

In general, trade-offs will give you flexibility at the cost of control, readability, predictability. Or, in the other direction, control at the cost of syntactic complexity, mental overhead, verboseness.

Redefining case/error as a bug:
No language/IDE support for expected cases. Mixes up programming errors and circumstantial/exceptional case handling. Reduced syntactic complexity, but at the cost of having to document and track these cases manually.
Redefining case/error as a failure:
Shift goal from normal application towards Proof-of-Concept. Only happy flow is appreciated. Any exceptional cases are considered out-of-scope. The application moves towards working only under predefined, perfectly matching circumstances. Everything else is out-of-scope.
Redefining bug as case/error:
Attempting to handle programming errors (possibly on the side using the logic) as alternative cases has a number of tricky consequences:
- More complexity in the logic, e.g. more complicated method signatures as more is handled.
- Mitigation of programming errors might be inexact and may result in unintended silencing of unexpected errors. Which in turn might corrupt application state.
- Increase of mental overhead for complete and correct use of the method.
Redefining failure as case/error:
Choose to handle more exotic error cases, such that your application becomes more robust. This is traded off for additional complexity and mental overhead. This is a trade-off that is typically applied to frameworks.
Frameworks are expected to be robust against failures caused by the logic they host. Logic that is hosted in a framework is often foreign, i.e. not developed by the same people, thus requiring the extra robustness as precautionary measure to guarantee robust operation of the framework.

Setting the default for unhandled cases/errors

Apart from the explicit handling of errors that occur, we may need to define how to handle unhandled errors. For example, following the convention described above, we would encounter all bug-type errors. If nothing is defined, languages and run-times typically output the raw information to the console’s stderr interface, i.e. dump it on the console. One may redefine what the default way of handling is, such that you can redirect these critical errors to, for example, a server log instead of the console. This would make the errors more easily discoverable or better accessible.

Such redirecting of unhandled errors may not be available in every application or language run-time or virtual machine. If it is available, it enables you to optimize the way in which unhandled errors are exposed.

Guidelines

Choose the most appropriate level in the call hierarchy to handle this the case.
Choose the best possible way to handle the case as to resume control flow. Either: (in order of preference)
- Adapt state and control flow to fix the problem.
- Mitigate the error to avoid cascading effects, if fixing is not possible. Rethrow if further mitigation is needed on other levels. Then resume executing the (adapted) control flow.
- Log the problem, if nothing else can be done.
Handle the case exactly once.
Apply additional side-effects.

Note “silencing an error” is when you handle the error by doing nothing, i.e. perform no action and leave no trace of the error behind. Silencing an error should never be necessary. If needed, log at low level to ensure that you always have some way of discovering the problem or a similar problem if it occurs. By silencing, you run the risk of silencing more than just the intended error.

Architectural considerations

So far we have discussed handling cases and errors in the context of a single application or architectural boundary. Within the same architectural boundary you will find a single convention for handling errors, whether default or with trade-offs. However, across architectural boundaries one might encounter different concerns.

Orthogonal concerns: “fail fast” and “fault tolerance”

Let’s take the example mentioned earlier: an application framework that hosts an arbitrary number of “client” applications. One can identify an architectural boundary in between the application framework and the hosted client application, the client application being untrusted, foreign logic from the perspective of the application framework. The client application would want to follow the default error handling conventions, as described above. The application framework, on the other hand, will need to be more robust such that it can handle failures whose root cause lies within the client application. For example, the application framework would need to be robust against programming errors (i.e. type bug) of the client application. Therefore it would have a catch-all for bug-type errors to catch anything originating from within a single hosted client application. (This mechanism will be present for each hosted client application.) It will then, without distinguishing between specific errors, report them as client application malfunctions to the application framework administrator. There is no sense, however, in handling bug-type errors anywhere else in the application framework itself. As always, we would not want to hide programming errors that might exist within the application framework logic.

The example above demonstrates orthogonal concerns. Each concern tackles a different problem. It is important to understand why these concerns are indeed orthogonal. “Failing fast” is a desirable property of an individual (implementation) component. The property is desirable for the simple reason that it is not desirable or beneficial to hide errors. “Fault tolerance” is a desirable property of the over-all solution. Given the knowledge that mistakes will happen, we need to handle these failures gracefully. Both are desirable properties that cannot coexist in the same space, i.e. you would be required to fail and not fail at the same time. However, by separating the concerns using an architectural boundary, it is possible to allow for failure in individual components, while mitigating failures in the over-all solution.

There may be different concerns for each architectural layer, and there may be independent concerns when crossing architectural boundaries. For that reason, the error handling conventions may be different for each layer, i.e. we make different trade-offs.

Preparation for projects

In order to make the handling of errors consistent across a project, it is valuable to explicitly define how to handle the various types of errors. This is an architectural concern, as you expect to reach certain (non-)functional goals.

Define the boundaries between the case-types (as defined in section Overview).
Define the (non-)functional requirements (goals) for each architectural layer.

An application design can then concretely specify how to handle specific types of errors to adhere to the architectural requirements, if this is even necessary.

Define how each type is expected to be handled, if this may not be obvious.
- including any trade-offs.
Define how unhandled errors are treated by the run-time/application framework, if not default.

Define the process of how to act when various case-types are encountered:

Bug: high priority, fix immediately. As these are programming errors, there is a one-time costs to fix it.
Case/Error: introduce as “new functionality” for new cases. Introduce as “bug report” for already-supported cases.
Failure: not typically handled. However, if occurs regularly, you may want to consider redesigning other (non-)functional requirements of your application, or investigate potential outside influences.

Consequences of error handling strategies

Most of the article, we have looked into different kinds of errors, how different errors can be classified, and how to handle them. It is also important to consider the consequences of error handling strategies on the user.

Ignoring or actively silencing errors.
This results in undefined/undocumented/unpredictable behavior. Applications (silently) produce bad results, report bad information, run forever, or crash. The corresponding system may be left in an unpredictable state, as may any system it interacted with.
Such applications cannot be trusted enough to even do the task they were designed to do. You cannot know whether or not your results can be trusted. Users will or should avoid these applications.
Failing on bugs.
This results in bad user experience, but ultimately can be resolved as the application knows its own boundaries. Its own system and systems it interacted with can be reset to a predictable state.
This indicates that there is more work to do. Users can trust the results if any, and may find themselves a workaround for the time being.
Failing on valid but unsupported cases.
The user is informed that the application does not work for their input. The user can make an informed decision whether to search for a different solution and/or keep using the application for cases with working input.
Failing for invalid cases.
The user is informed of problems with their input. The fact that this is discovered gives the user insight into the thoroughness of the application and trust in its results. With sufficiently detailed information, the error message informs the user in what ways the input is bad, such that it can be corrected.
No failures.
The only errors the application may produce are of serious malfunctions, which the user cannot correct anyways. The application is mature enough to assist the user with exceptional situations, either automatically or by instructing the user. The user can trust the application to work for them.

The cases above illustrate increasingly positive outcomes through (capable) error handling from the perspective of the user. Even if an application is part of a batch process, eventually a user is impacted by it. It is important to consider how errors impact the user.

FAQ

What about handling bad user input?
Bad input from a user is not itself considered an error. Instead, it means that the user input must be processed (e.g. verified, tweaked, escaped/encoded) before being used. This is still part of normal control flow. It only becomes an error if you feed this (bad) data illegally into a function call. However, then the bug is in the logic that uses this data as-is, instead of the data provided by the user itself.
How to handle the basic types in Java?
- Bug: throw a derivative of java.lang.RuntimeException but not RuntimeException itself.
- Case/Error: throw a derivative of java.lang.Exception but not Exception or RuntimeException.
- Failure: typically no need to throw anything yourself. The Java Virtual Machine will throw something derived from java.lang.Error, but not Exception.
How to handle the basic types in Go?
- Bug: call panic(), optionally with a message.
- Case/Error: return (possibly as an additional return type) an error.
- Failure: the run-time will panic().
How to handle the basic types in Rust? (Note information may be inaccurate.)
- Bug: call panic!() macro.
- Case/Error: return std::result::Result<T, E> type with T being the result type in case of success and E the error type in case of failure.
- Failure: t.b.d.

Changelog

2020-07-17 Added section on the consequences of error handling strategies.
2019-12-28 Emphasize the orthogonality of “fault tolerance” and “fail fast” in handling errors.
2019-06-05 Added section that identifies the underlying assumptions relevant to this article. Added section Changelog.
2019-05-21 Added Rust example.
2019-05-17 Initial version of article.

This post is part of the Living documents series.
Other posts in this series:

Timelessness