Skip to content

Improve Your Errors

August 27, 2017

One craft that I think goes undervalued among programmers is writing good error messages.

This is an especially important skill for engineers who build production systems, or libraries used by those systems. Usually the person who must decipher the meaning of the messages isn’t the author of the code (e.g. SREs or other on-call engineers). If the message is not well written, it can easily be misinterpreted, or even ignored. Also poorly composed errors waste the time of people who must figure out the actual meaning and importance.

Here’s the criteria you should consider when logging an error message:

  1. How do you explain the error?
  2. Is the error actionable?
  3. What is the appropriate logging level (ERROR, WARN, or INFO)?

Let’s consider these three points in more detail.

How do you explain the error? A good error message doesn’t explain what happened, it explains why it happened. Here’s an example I encountered recently while debugging a search query that returned an error response: the logs showed a NullPointerException thrown from code that looked up a scoring algorithm to rank documents matching the query. Being unfamiliar with this code, I found the engineer who wrote it and asked why it happened. Their answer was because the search service wasn’t configured to run the experimental scoring algorithm requested in the query. Ah, that makes sense, and my follow-up question was, then why is the error message a NullPointerException? They answered, because the reference to the scorer is null! WRONG! The author of the code knew exactly what condition caused the value to be null, and should’ve logged a message saying “Scoring algorithm foo requested in the query was not configured” or something to that effect.

A good rule of thumb is that given an error in the log message, you shouldn’t need to look at code to understand why the error happened.

Is the error actionable? There’s two reasons for determining if an error is actionable. The first is to offer a helpful suggestion for fixing the problem to the maintainer of the system. Using the previous example, maybe there’s a configuration file that people commonly forget to update. In this case, modifying the message to “Scoring algorithm foo requested in the query was not configured, please check scoring_experiments.xml.” When the error is encountered, now not only does the maintainer know why it happened, they know a possible action to fix it.

The second reason for determining if an error is actionable is knowing if it should even be an error at all. If there is no action someone can take to fix the error, then maybe logging it doesn’t add any value. The only purpose of reporting a non-actionable error is to create noise. And this brings us to our final point.

What is the appropriate logging level? When something goes wrong with a system, often the first step in debugging is scanning the logs for errors and warnings. It’s disheartening to discover there are tons of these messages, but none seem to be related to the current problem at hand. You start wondering, is our software so bad? How come nobody fixes all the reported problems? Is our reporting so poor that everyone has become desensitized it? And then you start ignoring errors yourself, because they don’t help you.

There’s some subjectivity in deciding which level a message belongs to, but too often there is little rhyme or reason. It helps to have some guidelines in place for choosing the appropriate level. The guidelines I like are:

  • ERROR The system should be able to run without logging any errors under normal circumstances. Errors should only happen because they are not expected. Anything logged as an error should be fixed immediately. Also, errors are usually things that require human intervention to fix correctly.
  • WARN Like errors, warnings should not happen under normal operation. They should describe something that is likely leading to other problems, like degraded operation (such as dropped queries). Unlike errors, they may not require any immediate human intervention to fix.
  • INFO Informational messages should describe something interesting in the system operation. It could be an event that was unexpected, but the system dealt with it by self-healing. It’s not a warning or error because the system handled it, but it is interesting, because it was an anomaly. There’s usually a lot of noise at the INFO level, because it’s not deemed critical people throw lots of extra logging there. But little of it is interesting unless care is taken to consider why this event requires a message. Generally, a log should have bunch of INFO messages at start-up that describe the states as the system transitions to running, but then once up, not much except for these “interesting” things.

In a distributed system, it’s normal for machines to disconnect. Often these kind of events are logged as errors, because disconnecting seems bad. But it’s probably only an error if the remote service is not redundant. In fact, for many systems this doesn’t even require a warning; a disconnect can be treated informational because the system can reach a replica server. Now if the number of active replicas dips below some threshold, then this might elevate to a warning, because you risk running into service degradation. And when all redundant servers are unreachable, this now qualifies as an error, because a human needs to look into what happened to those servers, network, etc.

When writing error messages, and other messages that go into system logs, it is good to keep in mind that people reading the messages need to understand why they happen. Explaining why instead of what, offering tips for fixing problems, and not wasting time by logging inappropriate or insignificant events goes a long way to improving the operability of your systems.

A Glossary of Game of Thrones Terms

January 6, 2015

Game of Thrones is a series of books written by George R. R. Martin. They are set in a fictional world filled with dragons, spells and walking dead. Nevertheless, to distinguish this work as fantasy, the author was compelled to use an artful lexicon of archaic words, which may confuse us modern speaking readers at first. If so, use this glossary of words you may encounter as a guide.

  • Eddard- a male name that may be abbreviated as “Ned” (just like “Edward”).
  • Milk of the poppy- an opiate-based drink to treat pain. Appears often and never abbreviated, as this world has not yet invented slang for their drugs. But if I were in Westeros, I’d surely call it “moppy.”
  • Nuncle – it means uncle, but with an inexplicable n in front.
  • Northron- northern, somehow spelled in an outdated Scottish way.
  • Pease- peas.
  • Robb- a male name, pronounced the same as “Rob.” God save us when Game of Thrones fans start naming their real children after these characters (see Eddard).
  • Ser- used to address a knight, but in the same way we might use the word “Sir.” In other words, misspelled.
  • Southron- see Northron.
  • Smallclothes- underwear. Actually an 18th century word for breeches, even though Westeros more resembles 11th century.
  • Strongwine- literally means “strong wine.”
  • Summerwine- literary tip: conjoin any adjective with wine to be swept away to a fantasy world.
  • Sweetcorn- a husked vegetable that grows in ears, with high sugar content.
  • Two and twenty- twenty-two. Ages are expressed least significant digit first because fuck knows why.

At first I thought most these words were made-up, but most can be found in references as archaic (and all from inconsistent centuries at that). However, there’s also a strong propensity to create words by simply joining two words. But despite this, it’s important to know that you can NEVER combine the word “break” with “fast.” In Game of Thrones, despite the number of joined words, breakfast is not a word. However “break your fast” appears a shitload.

I really like the stories told in Game of Thrones, but there’s nothing like encountering an archaic word to rouse you out of the fiction and back into “why am I reading this again?” I imagine the intent is to immerse the reader in the world, but it has the opposite effect on me–it’s not fooling me into thinking I’m reading a 16th century tome, it’s just annoying me.

Increase the Size of a Linux Disk in VirtualBox

March 26, 2014

After running out of space on a VirtualBox, I searched for some instructions on how to expand the size of the drive. I found a couple of good descriptions, such as this one, and this other one, both which are very similar.

Both of these instructions conclude with you happily finding the extra space when restarting your virtual machine. For me, this was a sadder experience. Once I rebooted the machine, it still showed the same amount of space. The reason was that the storage on my VM is managed by LVM, and although the physical space was expanded, the logical volume still needed to be extended in order for the system to use it.

This HOWTO reference explains how to extend a logical volume, and I used the following steps:

$ # Extend the volume to consume all of the free physical space
$ sudo lvextend -l +100%FREE /dev/mapper/vg_agent-lv_root
$ # Resize the filesystem to the logical volume size
$ sudo resize2fs /dev/mapper/vg_agent-lv_root

Here /dev/mapper/vg_agent-lv_root was the name of the root filesystem as reported by df. I was able to run these commands on the already mounted filesystem (they worked without requiring an unmount). After running the two commands, my root filesystem joyfully showed the correct full size.

Starbase Encoding Gotcha

February 24, 2014

There’s a library for Python called Stargate that makes it easy to interact with HBase over the REST server.

While trying this library, I discovered a potential gotcha if you’re expecting to be compatible with the Java library for HBase: if you insert any data that is not already a string via Starbase, then it will be converted to a string before it is written to HBase, whereas the Java library encodes the integers as raw bytes. It’s easy to work around, you just need to be aware of it.

For example, say I have an integer value stored in a python variable; to insert it to a column called “info:myInt” for row “fromStarbase” in Starbase looks something like:

my_int_value = 12345
t = c.table('testTable')
t.insert('fromStarbase', { 'info:MyInt':my_int_value })

As the Starbase library prepares this data for submission, it calls str() on the variable my_int_value, converting it to the string "12345".

Using the Java API, the same action looks something like:

int my_int_value = 12345;
Put p = new Put(Bytes.toBytes("fromJava"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("MyInt"),
    Bytes.toBytes(my_int_value));
HTableInterface table = pool.getTable("testTable");
table.put(p);

This uses the function Bytes.toBytes(int) to encode the integer value. In this case, it does not covert it to a string, but rather to an array of bytes in big-endian order.

Note that either approach is incorrect, because HBase deals in sequences of bytes, and is agnostic as to how you choose to encode data. However, by convention, I suspect the Java approach is more widely used, and therefore somewhat expected.

On the other hand, the HBase shell also converts values entered in the shell to a string, so maybe people do expect it:

put 'testTable', 'fromShell', 'info:MyInt', 12345

Doing a scan of the table from the shell we can see the difference in Java from the other two:

hbase(main):016:0> scan 'testTable'
ROW            COLUMN+CELL
 fromJava      column=info:MyInt, ts=1393306182999, value=\x00\x0009
 fromShell     column=info:MyInt, ts=1393301137823, value=12345
 fromStarbase  column=info:MyInt, ts=1393301137823, value=12345
3 row(s) in 0.0230 seconds

Converting to string does have the benefit of making it human readable from the shell. But for most data I would prefer the Java convention. To make the Starbase and Java share the same encoding for integers, encode the value before inserting it:

def encode_int(intval):
    """Encode integer as a string in big-endian order."""
    chars = [ chr((intval >> i) & 0xff) for i in range(24, -1, -8) ]
    return ''.join(chars)
 
my_int_value = 12345
my_encoded_int_value = encode_int(my_int_value)
t = c.table('testTable')
t.insert('fromStarbaseEnc', { 'info:MyInt':my_encoded_int_value })

Now the value is encoded the same as the Java client:

hbase(main):017:0> scan 'testTable'
ROW               COLUMN+CELL
 fromJava         column=info:MyInt, ts=1393306182999, value=\x00\x0009
 fromShell        column=info:MyInt, ts=1393301137823, value=12345
 fromStarbase     column=info:MyInt, ts=1393301137823, value=12345
 fromStarbaseEnc  column=info:MyInt, ts=1393306195342, value=\x00\x0009
4 row(s) in 0.0250 seconds

The Java library uses similar binary encodings for longs, floats and doubles, so for completeness you need the analogous encoders in python.

Restoring the Chrome App Page as the Default.

November 25, 2013

Shortly after the previous post, which mentions that I use the App page as the default browser start page, Google rolled out a change to Chrome where a search page appears in its place instead. To reach the App page now requires selecting an item from a menu bar.

I don’t care for this extra level of indirection, but fortunately my coworker clued me in on how to restore the previous behavior. In the Chrome flag settings, found using the link chrome://flags/#enable-instant-extended-api, is an option to disable the feature called “Instant Extended API.” With this disabled, the App page returns as the default.

There’s many other flags to play with, so knowing about this page might be a good thing (or bad, depending on how many things I end up breaking by tweaking too much).