This is the beginning of a small series of blog posts that explores the case study of the US Space Shuttle program and draws lessons for enterprise software projects.
Part 1: Background on the Space Shuttle program
By looking at some of NASA’s manned space flight history, we can see two of the easiest mistakes that managers make when self-preservation of a favorite project becomes their modus operandi. These mistakes are becoming blind to reality and forgetting how to listen to those closest to the action. These two mistakes can kill any mission-critical system implementation.
Last year I watched Apollo 13 with one of my daughters. Apollo 13 is the movie based on Jim Lovell’s book Lost Moon, which tells the true story of the Apollo mission just 8 months after the first landing on the moon. This is a dramatic story where everything that could go wrong in the mission does go wrong. After liftoff, when the mission is halfway to the moon, the astronauts heard a violent explosion that no one could explain. Suddenly the ground crew and the astronauts are faced with a situation that they have never faced before – no one knows what exploded or what subsystems have been damaged on the spacecraft. All they know is that the oxygen tanks and battery levels are falling.
The movie then shows the ground crew and the astronauts making a series of improvised solutions to various problems that arise, all while the astronauts know they have a slim chance to return alive. While watching this movie, both of my daughters said that they would be scared to be astronauts. Later, after the astronauts safely land and the movie finishes, my youngest daughter Olivia seemed to be thinking. Eventually she said to me, “But Dad, it must be much safer now that we have all this experience, and we must know better how to do this. Don’t we?”
It turns out there was no good way to answer Olivia. Do we really know better how to “do this”, if by “doing this” we mean safely sending astronauts into space and returning them while completing the mission? The evidence is not at all clear.
Obviously, one of the mission objectives of the space shuttle program has been to avoid the deaths of astronauts. The Space Shuttle aimed to provide routine, safe access to space flight while significantly lowering the cost per launch. By ‘routine’, NASA planned to safely launch 12 – 24 space shuttle missions per year. In fact, the goals of routine space flight and mission safety are the same goal – you cannot routinely launch manned space flights if you are losing space craft and astronauts. Loss of life in space flight always leads to long delays during the investigations.
In reality, the shuttle has launched 129 times in its 29 year history, averaging to less than 5 per year, with the peak of 9 per year in 1985. The primary reason for these low numbers of actual launches is that the space shuttle has far higher risks involved than managers were willing to acknowledge, leading to numerous program delays. There have been two lost missions – the Challenger disaster in 1986, and the Columbia disaster in 2003, each mission resulting in the deaths of all seven astronauts. The Challenger disaster in particular should have been a wake-up call that the space shuttle is not as safe or reliable as managers thought. Did they learn?
Challenger Investigation – A Learning Opportunity (in hindsight)
During the Challenger investigation in 1986, Richard Feynman, a Nobel prize winning physicist on the investigation team, addressed the safety issue and noted a cultural inability for NASA managers to deal with reality and listen to people closest to the action. As Feynman pointed out in 1986 in Appendix F of the official Challenger report [my highlights]:
It appears there are enormous differences of opinion as to the probability of a failure with loss of vehicle and of human life. The estimates range from roughly 1 in 100 to 1 in 100,000. The higher figures come from working engineers, and the very low figures come from management. What are the causes and consequences of this lack of agreement? Since 1 part in 100,000 would imply that one could launch a shuttle each day for 300 years expecting to lose only one, we could properly ask, “What is the cause of management’s fantastic faith in the machinery?”
Note that we are not talking about minor differences of opinion – we are talking about fundamentally different understandings of reality. Later in the same appendix Feynman gives us a glimpse of the answer, which is not very comforting:
NASA officials argue that the figure is much lower [than engineering estimates of 1 in 100]. They point out that “since the shuttle is a manned vehicle, the probability of mission success is necessarily very close to 1.0.” It is not very clear what this phrase means. Does it mean it is close to 1 or that it ought to be close to 1? They go on to explain, “Historically, this extremely high degree of mission success has given rise to a difference in philosophy between manned space flight programs and unmanned programs; i.e., numerical probability usage versus engineering judgment.”
My readers in 2010 will have the benefit of hindsight to realize that the engineers were very close in their estimates. The most recent NASA estimate is 1 in 100, which is very close to the observed rate of 1 in 60 for the past missions, and precisely the estimates provided by engineers during the Challenger investigation. It has taken 24 years for NASA management to finally learn what their own engineers knew in 1986, even though they had a Nobel laureate presenting them the evidence in the official investigation.
Do this situation remind you of any projects at your institution? In the next few blog posts, I’ll draw some lessons from the Space Shuttle program as it applies to enterprise software projects.