Anybody who has written a computer program appreciates the warm glow of having a machine do your work. That pleasure often masks the burden of holding the program’s hand — kicking it off and checking that it ran correctly.
At least for me, the natural tendency is to stop short of complete automation. Instead, if I’m being lazy, I’ll start the program manually and catch errors by noticing odd behavior. Perhaps I’ll see a glaring failure message. Or maybe I’ll just notice that the program ran too quickly.
I do this because I’ve seen trusty programs fail in myriad ways. I’ve seen malformed or truncated inputs. I’ve seen choppy or slow network connections. I’ve seen disks fill up. I’ve seen memory hogs.
When an error like that pops up when you’re sitting right there at the command prompt, it’s typically easy to diagnose and move on. When run naively from a scheduler, you might not notice a problem until disaster strikes, sparking a late night emergency call, or worse. Therefore, I have to fight my natural tendency to hand-hold my programs.
Through experience, I’ve learned to go the extra mile and properly automate my programs. These are some of my best practices — I hope other people can point out ones that I’ve overlooked.
- Every program must use a proper logger. The first thing I do when I write *any* program is to make sure it’s set up to log using that language’s preferred logging utility. This allows you to properly investigate what happened every time the program runs.
- Save the logs from each run. It’s possible that an error will go unnoticed for a while, and there’s nothing worse than saying “unfortunately the log file was deleted.”
- Validate the program’s inputs. I’ve been burned before by people who will “always save their Excel spreadsheet in the same way,” “websites that never change,” and data providers that “guarantee their data are validated before given to you.”
- Sanity check the outputs. If you always expect your program to generate some new data, then compare the new results to the old results, and verify that there are actually new results.
- Catch errors and retry. Computers are deterministic, but real world resources are unpredictable. I’ve seen one-off errors too often. Perhaps I call an external program that fails for an unknown reason. Or a file copy fails, but when I retry, it works fine.
- Have an extremely robust email notification system. When there’s an error, you should be notified. One trick is that a failure in the email system should never cause the program to fail. Perfect this library and reuse it.
- Try your program every time you change it. This is easy to say, but it’s tempting to say “oh, but I’m just adding one line” and not run your unit tests. If you change it, at the very least, give it a whirl.
- Run the program from multiple machines. Computers fail at all the wrong times. At the very least, design your program so that only one instance publishes a final result. Then if that computer fails, you can run a quick command on the other machine to publish whatever that program creates.


