How to Build Reliable Systems

[article]

In his Behaviorally Speaking series, Bob Aiello discusses hands-on software configuration management best practices within the context of organizational and group behavior.

Summary:

Bob Aiello describes some of the essential techniques necessary to ensure that systems can be upgraded and supported while enabling the business through frequent and continuous delivery of new system features.

Anyone who follows technology news is keenly aware that there have been a remarkable number of high-profile system glitches—at times, with catastrophic results. Major trading exchanges both in the US and in Tokyo have suffered serious outages that call into question the reliability of the world financial system itself. Knight Capital group has essentially ceased to exist as a corporate entity after what was reported to be a configuration management error that resulted in a one-day $440 million loss. Incidents like that highlight the importance of effective configuration management best practices and place a strong focus on the need for reliable systems. But what exactly makes a system reliable and how do we implement reliable systems? This article describes some of the essential techniques necessary to ensure that systems can be upgraded and supported while enabling the business through frequent and continuous delivery of new system features.

Mission-critical and enterprise-wide computer systems today are often very complex with many moving parts and even more interfaces between components; this presents special challenges even for expert configuration management engineers. These systems are getting more complex as the demand for features and rapid time to market provides unique issues that many technology professionals could not have envisioned even a few years ago.

Computer systems do more today, and many seem to learn more about us each and every day, evolving into intricate knowledge management systems that seem to anticipate our every need. High-frequency trading systems are just one example of multifaceted computer systems that must be supported by industry best practices; this is to ensure rapid and reliable system upgrades and implementation of market-driven new features.

These same systems can result in severe consequences when systems glitches occur, especially as a result of a failed systems upgrade. Finra is a highly respected regulatory authority that has recently issued a targeted examination letter to ten firms that support high-frequency trading systems. The letter requests that the firms provide information about their “software development lifecycle for trading algorithms, as well as controls surrounding automated trading technology” [1]. Some organizations may find it challenging to demonstrate adequate IT controls; really, the goal should be for implementing effective IT controls that help guarantee systems reliability.

Recently, I had the opportunity to teach configuration management best practices at the NITSL conference in Detroit for nuclear power plant engineers and quality assurance professionals. Everyone in the room was committed to software safety, including reliable safety systems.

In the IEEE, we are starting a working group to help update some of the related industry standards that help define software reliable, measures of dependability and safety. Make sure that you contact me directly if you are interesting in hearing more about participating in these worthwhile endeavors. Standards and frameworks are valuable, but it takes more than just guidelines to make reliable software. Most professionals focus on the importance of accurate requirements and well-written test scripts, which are essential, however, not sufficient to really create reliable software. What really needs to happen is that we build in quality from the very beginning which is an essential teaching that many of us learned from quality management guru W. Edwards Deming [2].

The key to success is to build the automated deployment pipeline from the very beginning of the application development lifecycle. We all know that software systems must be built with quality in mind from the beginning, and this includes the deployment framework itself. Using effective source code management practices along with automated application build, package, and deployment is only the beginning. You also need to understand that building a deployment factory is a major systems development itself. It has been my experience that many CM professionals forget to construct automated build, package, and deployment systems with the same rigor that they would a trading system. As the old adage says, “The chain is only as strong as its weakest link,” and inadequate deployment automation is indeed a very weak link.

Successful organizations understand that quality has to be a cultural norm. This means that development teams must take seriously everything from requirements management to version control of test scripts and release notes. Organizations need to take the time to train and support developers in the use of robust version control solutions and automated application build languages, such as Ant, Maven, Make, and MSBuild. The tools and plumbing to build, package, and deploy the application must be a first-class citizen and fundament component of the application development effort.

Agile development and DevOps are providing some key concepts and methodologies for achieving success, but the truth is that every organization has its own unique requirements, challenges, and critical success factors. If you want to be successful, then you need to approach this effort with the knowledge and perspective that critical systems are cumbersome to develop and also cumbersome to support. Building the automated deployment framework should not be an afterthought or an optional task started late in the process. Building quality into the development of intricate computer systems requires what Deming described in the first of fourteen points: "Create constancy of purpose for continual improvement of products and service to society” [2].    

We all know that nuclear power plants, medical-life support systems, and missile-defense systems must be reliable and they obviously must be upgraded from time to time—often due to uncontrollable market demands. Efforts by responsible regulatory agencies, such as Finra, are essential for helping financial service firms realize the importance of creating reliable systems. DevOps and configuration management best practices are fundamental to the successful creation of reliable software systems. You need to start this journey from the very beginning of the software and systems delivery effort. Make sure that you drop me a line and let me know what you are doing to develop reliable software systems!

[1] http://www.finra.org/Industry/Regulation/Guidance/TargetedExaminationLetters/P298161    

[2] Deming, W. Edwards (1986). Out of the Crisis. MIT Press

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.