Logging
Why a mundane topic gets us excited
Logging is one of those things that someone is either passionate about, or they just don't care. There's usually not very much of a middle ground. For those who are passionate about logging, hopefully this post will give you some new ideas and start a conversation or two. For those who don't care, I'd ask you the questions:
- Do you know what's going on in your systems right now?
- Do you know who's logged in on your systems right now?
- Do you know when the last problem happened with your system, and did you get enough details to do something useful about the problem?
You've built your system, you've suffered through the months of pain to get it working and ready to finally end up in someone else's hands, you let it out to a select few people who you trust to exercise it (but not overly abuse it) and let you know what they think. You wait for feedback. You wait for praise. You wait for condemnation. You wait...
Finally, you can wait no longer. You must know! So you pick up the phone or send, out a few e-mails to find out. "Oh, yeah. It started fine, but then, it just stopped working after I did X. I was going to get back to you on that, but everything else has been kind of hectic." What!?!?! What happened? You want to know. "What went wrong?" you ask yourself. Laboriously you walk the users through what they did, what they saw, and how the system reacted. You look at the code: "It should work..."
You finally get some trace output from one of your users, and it helps to narrow down the problem. You're stuck, though, because the traces you do have are not enough to help you zero in completely on the problem. You have a vague notion of where it is, but you need more details. Blast! Why isn't there more logging?!?
Here's the flip side, though, that can be even nastier. If you've built your system from the ground up with logging embedded everywhere you can think of, you can get yourself into a performance nightmare. Sure, you know exactly what your system is doing, exactly when, and which parts of the code are executing. However, when you look at overall system performance, it's spending 80% of its time preparing, parsing, and writing log messages, and the rest of the system is choked for resources.
Ok, sure. Just turn down the level of logging (you did include log levels, right?), and you won't spend so much time and resources, right? Well, sure. But what happens when something goes wrong? If you've turned your logging levels down to the point where system performance is acceptable, have you turned them down so far that they still have any value?
Well, that's just the difference between development systems and production systems, right? In dev, you leave lots of logging on so that you can catch the problems early. In production, you've fixed all the problems, so logging is no longer as necessary, and you can get away with turning the log levels down to just the bare minimum. Right?
There's a problem here. Go back to the story above, and read it again. In development, everything was working fine. Once the software is handed to others (put in production), strange things happen. Users tend to use our software in ways we didn't take into account. They tend to do things we didn't think of. If we don't capture enough logging when the system is in production, what will we do when there is a problem?
If the default production logging settings are useless for debugging problems, then they are useless, period.
Here at Hericus Software we've introduced a concept in logging that lets you have your cake and eat it too. The Zed Server includes sophisticated logging mechanisms that allow the server to run at highly optimized speeds while still maintaining a useful log of state and ongoing activities. The key to the logging implementation is a buffered approach that allows the server to capture logs to memory first, and, if and when required, write these logs physically to disk.
That's only half of it, though. Buffering disk writes has been around for a long time. That's not interesting any more. What's interesting is our mechanism to look back through the buffer when a new message comes in and write out "history" prior to the message.
Each log channel has two flags. The first is the Enabled flag. This flag indicates whether messages on this channel are always written to disk. The second is the Capture To Buffer flag. This flag indicates whether messages on this channel are written to the in-memory buffer. When any log message is written to disk, a certain quantity of previous messages that have been stored in the buffer will be written first, and then the log message will be written to disk. This gives each message written to disk a "history" of other messages that provide context for what was happening at the time leading up to the log message. Once a log message has been written to disk, the in-memory buffer is thrown away, and messages are again accumulated.
This allows the system to keep most logging enabled all of the time, with very little actual performance overhead incurred. Yet, any time a severe message is received, there will be enough context of what the server was doing leading up to the event to provide good debugging and tracing information.
For example, debug messages are usually not written to the disk, but they are captured to the buffer because they can contain useful information. However, they are only useful when something goes wrong. Most of the time they stay in the memory buffer for a short period and then get thrown away as the buffer cycles through. Every now and then, though, a debug message will be followed by an error message. When this happens, the error message causes the 50 (or so) prior debug, warning, and info messages to be physically written to the disk first, followed by the error message itself.
Context! All of the sudden you have context. Your error message may be great, but what happened just before it? With the logging system in Zed, we can see the context leading up to an error message and gain a wealth of information about what the problem really is.
Without that context, sometimes even the best error messages still leave you in the dark.
Download Your Free 60 Day Trial Now.