Skip to content

Improve Your Errors

August 27, 2017

One craft that I think goes undervalued among programmers is writing good error messages.

This is an especially important skill for engineers who build production systems, or libraries used by those systems. Usually the person who must decipher the meaning of the messages isn’t the author of the code (e.g. SREs or other on-call engineers). If the message is not well written, it can easily be misinterpreted, or even ignored. Also poorly composed errors waste the time of people who must figure out the actual meaning and importance.

Here’s the criteria you should consider when logging an error message:

  1. How do you explain the error?
  2. Is the error actionable?
  3. What is the appropriate logging level (ERROR, WARN, or INFO)?

Let’s consider these three points in more detail.

How do you explain the error? A good error message doesn’t explain what happened, it explains why it happened. Here’s an example I encountered recently while debugging a search query that returned an error response: the logs showed a NullPointerException thrown from code that looked up a scoring algorithm to rank documents matching the query. Being unfamiliar with this code, I found the engineer who wrote it and asked why it happened. Their answer was because the search service wasn’t configured to run the experimental scoring algorithm requested in the query. Ah, that makes sense, and my follow-up question was, then why is the error message a NullPointerException? They answered, because the reference to the scorer is null! WRONG! The author of the code knew exactly what condition caused the value to be null, and should’ve logged a message saying “Scoring algorithm foo requested in the query was not configured” or something to that effect.

A good rule of thumb is that given an error in the log message, you shouldn’t need to look at code to understand why the error happened.

Is the error actionable? There’s two reasons for determining if an error is actionable. The first is to offer a helpful suggestion for fixing the problem to the maintainer of the system. Using the previous example, maybe there’s a configuration file that people commonly forget to update. In this case, modifying the message to “Scoring algorithm foo requested in the query was not configured, please check scoring_experiments.xml.” When the error is encountered, now not only does the maintainer know why it happened, they know a possible action to fix it.

The second reason for determining if an error is actionable is knowing if it should even be an error at all. If there is no action someone can take to fix the error, then maybe logging it doesn’t add any value. The only purpose of reporting a non-actionable error is to create noise. And this brings us to our final point.

What is the appropriate logging level? When something goes wrong with a system, often the first step in debugging is scanning the logs for errors and warnings. It’s disheartening to discover there are tons of these messages, but none seem to be related to the current problem at hand. You start wondering, is our software so bad? How come nobody fixes all the reported problems? Is our reporting so poor that everyone has become desensitized it? And then you start ignoring errors yourself, because they don’t help you.

There’s some subjectivity in deciding which level a message belongs to, but too often there is little rhyme or reason. It helps to have some guidelines in place for choosing the appropriate level. The guidelines I like are:

  • ERROR The system should be able to run without logging any errors under normal circumstances. Errors should only happen because they are not expected. Anything logged as an error should be fixed immediately. Also, errors are usually things that require human intervention to fix correctly.
  • WARN Like errors, warnings should not happen under normal operation. They should describe something that is likely leading to other problems, like degraded operation (such as dropped queries). Unlike errors, they may not require any immediate human intervention to fix.
  • INFO Informational messages should describe something interesting in the system operation. It could be an event that was unexpected, but the system dealt with it by self-healing. It’s not a warning or error because the system handled it, but it is interesting, because it was an anomaly. There’s usually a lot of noise at the INFO level, because it’s not deemed critical people throw lots of extra logging there. But little of it is interesting unless care is taken to consider why this event requires a message. Generally, a log should have bunch of INFO messages at start-up that describe the states as the system transitions to running, but then once up, not much except for these “interesting” things.

In a distributed system, it’s normal for machines to disconnect. Often these kind of events are logged as errors, because disconnecting seems bad. But it’s probably only an error if the remote service is not redundant. In fact, for many systems this doesn’t even require a warning; a disconnect can be treated informational because the system can reach a replica server. Now if the number of active replicas dips below some threshold, then this might elevate to a warning, because you risk running into service degradation. And when all redundant servers are unreachable, this now qualifies as an error, because a human needs to look into what happened to those servers, network, etc.

When writing error messages, and other messages that go into system logs, it is good to keep in mind that people reading the messages need to understand why they happen. Explaining why instead of what, offering tips for fixing problems, and not wasting time by logging inappropriate or insignificant events goes a long way to improving the operability of your systems.

No comments yet

Leave a Reply

Your email address will not be published. Required fields are marked *