Logging, Profiling, Debugging, and Reporting Progress (by )

I have a hypothesis. The hypothesis is that, in software, the goals of logging stuff for audit purposes, reporting progress to a user on the course of a long-winded process, profiling a system to see where it is spending its time in the long run, and debugging certain kinds of problems (in particular, hangs of multithreaded systems come to mind, although many other kinds of problems too), can all fundamentally be met by a single piece of software infrastructure.

I say this because, largely, I write very similar code to do all four of them, and end up building unnervingly similar-looking ad-hoc solutions, despite the fact that a single well-developed tool could do all of them really well.

Basically, all of them boil down to a bit of code stating what it's about to do and what it's just done at interesting points.

Let's look at each in turn.

Software often writes, either to a log file for unattended bits of software such as daemons, or direct to the user if there is one, lines of text describing the current state of affairs. Generally, it logs when steps in the process are started (to document what it's about to do), when steps end (to document the results), and sometimes during iterations over something or other (although that's really just a special case of the previous cases, as each iteration is itself a step). This is often done for auditing reasons - to find out what operations led up to some state of the system, for a plethora of reasons - and sometimes to offer a kind of primitive progress reporting; you can look at the log to see how far something has gotten.

But we have more sophisticated progress reporting, sometimes. Rather than using a dumb logging facility to say "I'm about to process 1000 objects", then "I have processed N objects so far" periodically and letting the user read the text and do the maths to work out how far through we are, we can use a more evolved progress reporting infrastructure (usually called a "progress bar"), telling it when the process starts (displaying the bar) and ends (removing the bar from the display) and periodically updating it on progress by telling it how many steps we have to do and how many are currently done. Given that higher-level information, it can produce a more intuitive display, and even estimate the time to completion and report a rate of progress.

Although the display is rather different, there is a great commonality in the interface - we are stating the fact that a process has started (and how many steps it will take), and periodically stating how far we have done, and then confirming when we are finished. A high-level progress reporting interface could just as easily generate lines in a log file as display a progress bar on a graphical screen, from the same software interface.

I've also used logging for debugging, quite extensively. Rather than logging user-level messages that relate to the user's model of the state of the system, I can also log messages about the internal state of things "under the covers", and use that to guide my own investigations of problems caused by the internal state going awry. Sometimes I will do that by just printing those messages to a console I can observe, and removing or commenting out the code that does so when I'm not debugging; but more featureful logging infrastructures allow for "debug logging" that is hidden away unless specially requested, meaning I can leave the logging statements present in "production code" and turn them on at run-time, which can be a great boon.

Meanwhile, in profiling and debugging server applications with multiple threads handling requests in parallel, I have often used a particular technique. I've given each thread a thread-local "current status string", and then peppered my code at interesting points with a call to set the current thread's status string with a summary of what it's about to do. I can then easily ask the system to display a dump of all running threads and what they're currently doing. Java makes this easy with threads having a name string and the ability to introspect the tree of threads and thread groups; in a system written in C using different processes rather than threads, I've written my own infrastructure to do it using a shared memory segment with each process getting a fixed-sized slot therein and some lock-free atomic update mechanisms to avoid races between the processes and the sampling tool.

This lets me do two things. Firstly, when the system is grindingly slow, I can quickly see what every thread is up to. Is everyone all blocked in the same operation, all queuing for the same limited resource? Secondly, when something hangs, I can look at the state of the unresponsive threads to see what they're doing. Generally, this shows me what lock they're stalled on, and who has the lock (or, more tellingly, if nobody seems to have the lock). And finally, I can profile the system during heavy load by periodically sampling the status of each thread and then building a histogram of statuses, to see which statuses take up most of the time. However, I had to be careful with this - it worked well if the status string merely recorded what step was in progress, but not if the status string included details of the data being worked upon, because that made the strings different even if the system was at the same step, so they didn't count for the same histogram bucket. The solution was to mark such parameter data in some way (such as by always quoting it with square brackets) so that profiling tools can make themselves blind to them.

Sometimes I've run into trouble with the fact that the same procedure might be called in many different places; so the thread status shows me that I'm using that procedure (and where I am in it) but doesn't tell me why. If I was attaching a debugger I could view the entire stacktrace, but that pauses execution (which may interfere with the delicate race condition I'm hunting down) and it's fiddly to do that for every thread or process in a system to get a good snapshot of the overall state, which is why I prefer the explicit-status approach in the first place. The solution is simple: rather than replacing the entire status string, each procedure should just append their status to the existing string, and remove it when it's done. In Java I did this by capturing the current thread name and then assigning "originalName + processStep" each time, then restoring "originalName" at the end of the process; in C, I did it by recording a pointer to the end of the current string and just writing our new string there, and then setting it to "\0" at the end (being careful not to overrun the buffer!). This turned the status string into a kind of mini-backtrace, but rather than logging each and every procedure call, it only logs things considered important enough to log. When looking at a snapshot of a system with a hundred concurrent requests in progress, this is a great time-saver.

But, clearly, all of the above really just comes down to "reporting the current status of a thread of execution". All that changes is what the infrastructure does with that information. There's no reason why the same notification from the software cannot generate a line in a log file, an on-screen log message to a waiting user, the creation, updating, or removal of a progress bar, the updating of a global scoreboard on what all the threads in the system are up to, and the updating of a profiling histogram of where the system is spending its resources.

So here's my proposal for an API.

(task-start! name [args...]) => task handle
Starts a task, which is some unit of work. Tasks may be dynamically nested within a thread, opening tasks within other tasks. Returns a task handle. The name is a string describing the task, which should itself be static, but may refer to the additional arguments positionally with %<number> syntax.
(task-status! handle status [args...]) => void
Notifies the system that the given task's current status is the given status string, which again may refer to the arguments positionally with %<number> syntax.
(counted-task-start! steps unit-s unit-p name [args...]) => task handle
Starts a task with the given number of counted steps. The singular name of a step is unit-s and the plural name is unit-p, eg "byte" and "bytes". Otherwise, as per task-start!
(counted-task-status! handle steps-completed [status [args...]]) => void
Notifies the system that the given counted task's current status is the given status string, which again may refer to the arguments positionally with %<number> syntax, and that the given number of steps have been completed so far. The status and arguments can be omitted if there's really nothing more to say then the number of steps completed.
(task-end! handle) => void
Ends a task (counted or not), given its handle.
(task-fault! handle message [args...]) => void
Logs a fatal fault (internal failure within the task) of a task. The message should be a static string, referring positionally to the arguments with the usual syntax.
(task-error! handle message [args...]) => void
Logs an fatal error (invalid inputs, or invalid behaviour from a subcomponent, but not a problem with the task itself) encountered by a task. The message is as above.
(task-warning! handle message [args...]) => void
Logs a non-fatal problem encountered by a task. The message is as above.
(task-debug! handle message [args...]) => void
Logs a debug-level event encountered by a task. The message is as above.

Now, this looks like a very ordinary logging framework, with the exception of an explicit hierarchy of nested tasks and the explicit mention of "counted" tasks with a known number of steps. Yet those two additions allow for the same interface to cover all of the above goals.

Seeing how they might generate a log file for auditing is trivial. The explicit knowledge of task nesting lets us give context to common subtasks used in lots of different places, be it by something as simple as indenting each log message according to the subtask nesting depth or creating a an aggregrate "task pathname" by combining all the names of parent tasks into one long string to log.

Generating a profile is also trivial; using the task names and status strings without interpolating the arguments, we can obtain a measure of what bit of the code we're in - either in isolation, or including the entire context of parent tasks, as desired. And we can also generate histograms of the arguments for each different status if we want; if a given subtask takes widely varying amounts of time depending on how it's called, we can find out what arguments make it run slowly to narrow down the problem.

Debugging is helped by turning on the display of debug-level messages in the log, and by making a snapshot of the current status (status and name of the current task and all parent tasks) of each thread/process in the system available for introspecting the current state of the system. That's a useful thing to include in a crash log, too.

But reporting progress to the user is where having a common infrastructure really shines. Rather than needing the explicit construction of "progress dialogs" in interactive applications, the infrastucture could notice when a thread has spent more than a few seconds inside a counted task and produce on automatically. It would display the progress of the highest-level parent counted task of the current task hierarchy, as the overall progress measure of the operation in progress; but if a subtask takes more than a few seconds, then it becomes worthwhile automatically expanding the dialog to list the entire hierarchy of nested counted tasks with progress bars (the very tip of the hierarchy, that has not gained several seconds of age, should be elided to avoid spamming the user with endless wizzing progress bars for short-lived subtasks; only display subtasks that have shown themselves to take user-interesting amounts of time). And the display of non-counted subtasks and their textual statuses within the hierarchy can be turned on as an option for users who want to see "more verbose" information, perhaps along with each task then growing a scrollable log showing that task's history of statuses, warnings, starting subtasks, and so on.

How best to handle multithreaded tasks is an open question, depending really on the threading model of the language you're using. Perhaps tasks should require an explicit parent-task handle to be passed in, to make it clear what the task hierarchy is in the presence of threads; or perhaps newly created threads should inherit their parent's current task. Either way, with threading, it's possible for a task to be parent to more than one active subtask, and a progress reporting user interface will need to handle that, perhaps by dividing the space underneath the parent task into multiple vertical columns for each concurrent subtask, when that level of detail is required.

Also left unspecified is more detail on the "log level" concept; I've just suggested a few (fault, error, warning, status update (called "notice" or "info" in most existing logging systems) and debug, but the system really needs to make some intelligent decision as to who to notify. A server-based interactive application given a fault condition really needs to apologise to the user in vague terms, while sending a detailled crash log and raising an alarm to the system's administrators, for instance. And more complex systems composed of multiple components may have fairly complex rules about who should be notified about what level of event in what context, which I've not even begun to worry about here...

But, to conclude, I think it's a shame that there's so many very different bits of infrastructure that have very similar interfaces and are used in similar ways by software, but with very different goals. I think it'd be great to create a single progress-reporting interface for applications to use, and make different goals into pluggable components beneath that interface!

2 Comments

  • By Ben, Fri 28th Dec 2012 @ 9:20 am

    Doesn't DTrace cover most, if not all, of this? It has a common API by specifying probes, which give you an awful lot of flexibility.

    And it's been used for

    1) Auditing activity in the system (eg Luke's ZFS replication triggers)

    2) Interactive display of information (eg the "top"-like scripts)

    3) System wide profiling

    4) Debugging of specific problems

    Although only the latter two are things it's primarily designed to achieve, it's proven itself remarkably flexible. So at the least it should be an interesting model for you to examine.

  • By alaric, Fri 28th Dec 2012 @ 1:07 pm

    DTrace is nice, but unless I've missed something the DTRACE_PROBE interface doesn't have a way to specify the hierarchical structure of nested tasks, nor of saying "this comples task is N% through" in a standard way, which are the two key things in my API required to provide the functionality beyond traditional systems!

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales