Beyond Earth (ATWG) - Chapter 30 - Managing Risks On The Space Frontier, The Paradox of Safety, Reliability and Risk Taking by F. Hsu and R. Duffey

From The Space Library

Jump to: navigation, search

Chapter 30

Managing Risks On The Space Frontier, The Paradox of Safety, Reliability and Risk Taking

by Feng Hsu, Ph.D. NASA, USA and Romney Duffey, Ph.D. AECL, Canada

Introduction

Is human space flight only for the highly trained and well-educated astronauts? Or is it just for a few of the wealthiest space tourists among us? Can commercial space travel be made safe yet affordable for us all? Can human space venture be successful without the possible sacrifice of lives? Attempting to answer these questions is by no means a trivial endeavor. It must certainly lead to rigorous debates, inevitably probing into the paradox of risk taking in the face of safety and uncertainty in humanity's continuous drive for the opening of the Space Frontier.

With the profound resurgence of human ingenuity in the dawn of the new millennium, technological frontiers have been created and pushed forward at such a stunning pace never before seen in the recorded history of human civilizations. In the wake of the information age, where the Internet-propelled digital packets permeates every fabric of human lives, the nano-technology and bioengineering revolutions are transforming human society permanently and irreversibly. All in all, recent advances in the space frontier has been no less than astonishing, and is quietly incubating yet another technological revolution. This certainly will change not only the fate of human spices on earth, but will perhaps also transform us from single-planet habitant into multi-planet creatures to forever survive in the vast universe! Following the President Bush's bold vision of human space explorations announced in January 2004, only a few month later comes the headline making successful flight of SpaceshipOne. Launched out of the Mojave Desert of California, the first privately financed, commercial vehicle for taking ordinary citizens to sub-orbital space and safely landed back on earth. Subsequently, SpaceshipOne's financial backers have promised to develop and promote routine space tourism in the near future. Their flight joins a list of several other flights since four years ago that took regular travelers commercially to space. A recent partnership powered by expertise from Burt Rutan and his team at Mojave, California-based Scaled Composites, and more than $25 million from software billionaire Paul Allen, called the Spaceship Company, have been formed by Rutan's team and British tycoon Richard Branson's Virgin Galactic. They plan to build a fleet of five seven-passenger "SpaceShipTwo" spacecraft using SpaceShipOne technology. Most recently, they have just announced to build a commercial spaceport in New Mexico, and are aiming to begin space tour service as soon as in late 2008.

Infused by the commercial significance of SpaceshipOne test flight, there has been a world wide heated space movement propelled at full steam by entrepreneurs, politicians, space enthusiasts and business owners alike. One year after the history-making SpaceShipOne flight, and subsequently winning the 10 million dollar X Prize, Dr. Peter Diamandis, the X Prize founder is joining forces with a venture capitalist to establish a NASCAR-like rocket-racing league (RRL) for rocket-powered aircraft. XCOR founder Jeff Greason, after successfully developed a methane-fueled rocket engine, has recently set up yet another prize for steam engine development. The 41 year old founder of Amazon.com, Jeff Bezos, a native of Texas announced several months ago the intent to use his newly acquired 165000 acres of desolate ranch land to build a world's largest space port in west Texas in order to launch space vehicles that carry ordinary space travelers. Among other space enthusiasts, entrepreneurs and privately owned aerospace pioneers, is the Las Vegas based Biglow aerospace, which has unveiled their latest effort to build an inflatable space habitat as well as space stations to help inspire and drive humanity's permanent entry and colonization into the space frontier. A recent report has revealed that in three to five years from now, a company called Space Adventures, based in Arlington, Virginia, is already planning to offer commercial trips to orbit the moon for only $100 million per passenger. In spite of the launch of the first private rocket into space, the private piloted manned space flight has just reached the tip of the ice burger. Nevertheless, it won't be very long that the civilian suborbital flights could become reality much like the way the aviation jetliners operate today that not only new jobs can be created but also a whole new industry will be raising on the horizon. Despite SpaceShipOne's success, despite the glorious human achievements in the Apollo days, as well as the 20 plus years' Space Shuttle flight experience with recent marvelous missions of Mars exploration Landers, human space explorers, NASA and space travel entrepreneurs still have very tough challenges ahead. These challenges apparently, are not necessarily technological ones only. Do we really know how to make spaceships that can fly a couple of times a day, every day for years? How can we fly space orbital vehicles so safely and so reliably that we can fly people as a business much like the way commercial airlines do? Do we really know the economics of the emerging space flight industry and how to make it profitable enough to sustain the survival of the space travel commerce…? Key questions emerge: if we are ever going to free ourselves from the earth confined human race, how much are the societal, political and individual safety risks as well as resource sacrifices are we willing to take for opening up the new frontiers in space? Are the risks of manned space flight worth taking in the course of human exploration into the heavens? Are such risks expanding or contracting as we continue to push opening further the door to the bondless Universe? What are the balances we ought to take for benefiting human lives on earth versus societal, political, financial and human life costs in the course of continued exploration into the outer planets? How safe is safe enough for us to step ahead on the continued course for space explorations every time when tragedy and accidents struck us? When does greater efficiency approach reckless risk taking when confronted with safety and resources constraints? This chapter provides some interesting insights from various perspectives of technological, programmatic and strategic risk taking to social political issues while trying to explore the answers to these questions.

Managing the Human Elements of Risk Taking

The history of fatal accidents and tragedies in human space activities has always reminded us of an extremely high risky endeavor that requires not only highly costly social, political and financial resources, but oftentimes the sacrifice of human lives as well. The most horrific examples of risks to the US manned space programs are the fatal accidents that occurred with Apollo 1 on January 27, 1967 and with the two Space Shuttle disasters of Challenger and Columbia on January 28, 1986 and February 1st, 2003 respectively. The human reactions and societal policy responses to these tragic events have been repeatedly illustrative of exhaustive engineering reviews, detailed hazard assessments, expensive hardware tests and redesigns by system experts. These reviews have been exemplified by the search for cause and by the assignment of blame. And most notably, are the lengthy and high profiled accident investigations by Congress, Presidential Commissions that often lead to the replacement of key program managers or reorganization of institutional hierarchies and infrastructures regardless of actual effectiveness of implementing these various measures. Nonetheless, sooner or later more fatal accidents still strike us unexpectedly every time after each major accident, even after extensive, painstakingly detailed and careful investigation, exhaustive safety assessment and re-engineering, and changes in requirements and organizational structures and communications. Accidents as evident in many other industries, whether it was the Chernobyl reactor explosion, or Three Mile Island nuclear reactor melt down in 1986 and 1979, the Piper Alpha offshore oil platform fire in the Norwegian sea in 1993, or the Concorde plane crash in Paris in 2000, they all exhibit familiar interesting pattern. The primary human response to these tragic events were either to raise serious political doubts about the risks behind the economical and technological benefits of the industry in general, or to carry out extraordinarily lengthy investigations which often focused primarily on technological specifics of the accident phenomena that caused the tragedy.

Why do accidents and adverse outcomes happen all the time, in spite of most of them being preventable? As humans we tend to fix the known, what we observe to be wrong or the cause, and largely ignore the unknown since it has yet occurred. Therefore fixing not only the observed causes but more importantly, to explore the unknown in a systematic and integrated approach is the key to prevent disasters or reducing the likelihood of fatal events from reoccurrence. Even while we fear the unknown, we should still face it. While exploring the unknown harbors important implications that human exploration of the Space Frontier is inherently a high risk endeavor, only a few of us could fully recognize this very nature of our space activities. Thus, the general public perceptions of the true risk and their support or political will to continue moving forward on the space frontier have become the biggest casualty of any accident in the past. Can we imagine that without the societal and political strong support, we could have succeeded in landing humans on the moon just two years after the Apollo 1 disaster struck us; or some seven months after the near miss of Apollo 8's first manned mission to ever orbit the moon back in the Apollo era? Can we also imagine that without strong yet continued political will from their respective dynasties or regimes, the ancient Chinese could have built the Great War stretching over 6700 kilometers of mountainous terrain in hundreds of years of time span over two thousand years ago? Or the Americans could have built the Panama Canal connecting the two major oceans on earth? Or could we have built the marvelous Hoover Dam at all?

On the contrary, despite going through all sorts of lengthy safety "stand-downs" within NASA organizations and spending nearly three years on the RTF (return to flight) activities, and hundreds of millions dollars to ensure the Space Shuttle safety after the Columbia tragedy, we barely missed yet another major disaster on the subsequent flight of Discovery (STS-114). Have any of us thought about why we were able to land ourselves on the Moon some 35 years ago, just less than 9 years after President Kennedy's famous political rally on May 25thof 1961 till the successful mission of Apollo-11 in which a "giant leap for mankind" was set forth on the surface of the moon? Why after 35 years today, two years have already gone by but we still aren't quite sure if we can accomplish the first phase of the new space initiative (going to the Moon) by 2018 since President Bush's passionate announcement on January 14, 2004? Can we really say that beside the "space race" element in the 1960s (which can also be attributed to a different type of strong political will and public support motivated by cold war rather than peaceful space exploration), the Challenger and Columbia syndrome are not taking a toll deep inside our human spirit and courage, resulting in the overly conservative constellation program plan for exploring into the unknown? Clearly, perhaps the single most damage from space disasters has been "fear" or "fear of moving forward" that has seeped deep into our human minds. It is the loss of continued societal and political support, which ultimately brings about our inability to make reasonable decisions on taking risks, which of course includes exploring the unknowns in space. As the perception of risk is governed by the unknown outcomes, we must not follow such a rabbit trail repeatedly fearing failure, if we really posses the most invaluable and superb human spirit to explore the universe. The real challenges of managing the safety risks indeed are not necessarily the technological ones, but something lying inside the human spirit and willingness for continued exploration of the unknown. In the history of civilization, human beings have always proven ourselves able to overcome the inconceivably complex technological challenges as long as there was strong societal will and continued political support. The most severe consequence only possible to human society perhaps, of any disasters from accidents in socio-technological systems is the loss of human spirit for adventures, and for exploring the unknown, the frontier in nature and in the Universe.

Managing the Human Perceptions of Risk

While realizing the risky nature of the human space endeavor and carefully managing the risks for safety and system reliability, we must also rectify some biased risk perceptions of human space flight that have been deeply rooted in the minds of general public, and more importantly in the minds of politicians. The risk profiles and public perceptions of the space related activities have always been extraordinarily high as compared to other human activities involving complex systems. Taking the commercial airline industry for example, airplane crushes with over hundreds of human casualties occur rather regularly, at least a couple of times once every few years. However the general public seems to have developed a psychic barrier to fear that they keep on traveling by flying as if there was nothing had happened. The risk perception is that the risk is "acceptable": compared to other risks in life it is small. Thus, there has never any notable open criticism neither by the public media, nor of any political obstacles call for the demolishment or abandonment of the civil aviation industry, even after some major airline crashes that resulted in sizable human casualties. The only commercial airline crash in history that has brought an end to an invaluable and complex technology was the total dismantling of the Concorde supersonic fleet by Air France and British Airways. This became such a regrettable tragedy because it was much more of a commercial failure, or the failure of miss-calculating the risk of a strategic program decision rather than the failure of the Concorde technology itself. Can we imagine that all the commercial air travel is possible today if the politicians did not give strong backing (often times taking a risk on their political career by acting against the perhaps biased public risk perception or political views) back in the early 1930s at the infancy of the modern civil aviation industry? Similarly, the safety risk of the commercial space travel or space tourism, just like any other risks of complex technological marvels that humans have created so far, that it takes painstaking processes and time for humans to understand.

Figure 30.1. The Steady Increase of Survival Rate in Airline Safety

And more importantly, it takes the human perseverance to allow such risks be gradually reduced to a point where both aspects of technology and the human interfaces with such technology becoming highly matured. As an example Figure 30.1 shows some tangible statistics as how the survival rates or flight safety in the commercial airlines have been steadily improved over the years ever since the early days of civil aviation industry. The survivability aspect is key because we cannot reduce the rate of these (rare) events any further. Just like the automobile, we adopt mitigating measures in case of a crash.

Accepting Risk in Emerging Technologies

Every new technology is greeted with some initial fear and trepidation. It is a natural human reaction to the unknown, and an uncomfortable feeling. Examples are legion: automobile transport replacing horses; electricity replacing candles and gas lamps; genetically engineered crops replacing selected strains; bio-technology replacing traditional medicines: vaccines replacing blood letting; nuclear energy replacing coal. Obviously, if we fall short of earning public support or stop short of moving forward on the human space activity front, there could never be the future for the emerging space travel and tourism industry. With accepting some initial R&D risk of human suborbital flight today, we can then truly envisage a day for human beings in the near future such that, one could safely take-off of a Space Port somewhere in the west Texas desert shortly after his (her) breakfast, and to be safely landed at China's east coast city of Shanghai for a trade meeting, and then taking another (space) flight back to Houston for dinner on the same day!


Table 30.1 Is Driving Safer than Flying?


Table 30.2. Is Flying Really More Risky?

The news media would rarely make any more than a day's headlines or front-page reports out of a seemingly regular aviation accident. Whereas, the general public and societal responses to the Space Shuttle crash, or any nuclear plant incidents, have been profoundly overly sensitive and sensational compared to a commercial plane crash that might have killed many times more of human lives. It is human nature that the perception of risk of the unknown far exceeds the reality. The same biased risk perceptions by the general public exist in the case of automobile accidents, that no one seems to fear of driving given that the automobile risk (likelihood of being killed per single trip) is actually 1 in 7.6 million, nearly an order of magnitude higher or10 times greater than the commercial airline risk of 1 in 52.6 million chance being killed on a single trip. Table 30.1, 30.2 and Figure 30.2, 30.3,

30.4 provide more detailed risk comparison between several differing industrial and human activities. Rare events and learning curves show that airlines with more flights have lower rates, simply because of the greater number of flights before having an accident or not, which is the equivalent of being lucky or not. This rare event line is shown (Figure 30.2) as a line of slope minus one through the data, and is an equal risk line.

Figure 30.2. Fatal accident rate of commercial vs Space Shuttle

The paradox with a new technology like space travel is simply this. It is impossible to have a lower accident rate until many flights have occurred; meanwhile unless we have many more flights we cannot learn to reduce the number. Basically, we must have events in order to learn from them. The reason why the risk perception of Space flight in the minds of politicians or general public are so much biased is primarily due to the inherent high media and political profiles of human space activities. Society generally does not understand the differences between the matured technology, in the case of commercial airlines, and the research and developmental (R&D) technologies as in the case of human space explorations. They may tend to simply perceive the risk of a Space Shuttle flight from the perspectives that of a commercial airline flight, therefore wishfully expecting that the human space activities should be as safe as that of the civil aviation activities. This is clearly unattainable at an early stage of any technology, like space travel.


Figure 30.3. Typical percentage distribution of airline fatalities by phase of flight

Yet unscientifically founded societal and political expectations must be rectified before major endeavors can be undertaken in the course of successful human space frontier explorations. On the other hand, we must understand that the psychology of risk perceptions by the general public on various human activities are easily influenced by the visibility of the underlying events, and by the degree to which the mass media portray them. Again, as in the case of the space shuttle accidents, yes, a total of 14 brave astronauts were sacrificed in the entire SSP (space shuttle program) history of nearly a quarter of a century time span. Yet the society and space policy makers seemed so eagle to dismantling the entire program over safety concerns while blind-sighted on the overwhelming benefits in various aspects of engineering, scientific, economical and international social-political achievements. Indeed, if it was merely a safety risk concern in the minds of our space policy-makers on the total abandonment of the existing space shuttle program, then it is clearly worth the time to have a second look at the risk problem before it is too late.


Figure 30.4. Frequency of Fatalities due to man-caused events


Dealing with Moral Issues in Risk Taking

There is also a moral issue at stake: How do we really measure the loss of individual lives as risk factor or ultimate cost to our society? Especially, how do we value the lives of 14 astronauts versus the lives of tens of thousands passengers who get killed each year either by commercial plane crashes or automobile accidents, or those who gave their lives on duties for the benefits of public good, like the test pilots, medical doctors, Army soldiers, police or fire fighters? Arguably, there are professors, scientists and well-trained technologists and important public officials among the many killed whose death to our societal loss can be equally as great as that of the astronauts sacrificed.

In pioneering any technology, risks must be taken whether they are financial, technological, operational, legal, societal, or human. These questions are profound in their ethical, social, moral and technological implications. One of the increasing obstacles facing human society, especially facing Americans today in exploring science and technology frontiers is that the skyrocketing legal risks, which undoubtedly have become key factors to the ever increased societal cost or high budgetary pressures. The human spirit of taking risks in unknown territory of yesterday is certainly worth remembering today. The inventor of the smallpox vaccine tested it on himself first but did not die; one of the discoverers of X-rays over exposed themselves and did. Likewise, is it a worthy sacrifice of astronauts in the case of an accident for the noble course of mankind's ferrying into space, in which many of us would have been dreamed of being a part of? Should the Wright brothers give up their extremely risky test fly in the face of death over 100 years ago? If American society had blindly criticized and prevented the transatlantic adventure of Charles Lindbergh's bravery back in May 20thof 1927, we would have no commercial airline industry of today, and the human lives on earth could have never been the same. Instead, he was lionized for his successful risk taking. Losing fourteen people who were very important to their families and friends in addition to being important to the whole human race in such an unexpected way is horrible and tragic. Such losses, though, are not only well worthy of the course they were engaged in, but part of life as well. As bad as the deaths of these fourteen astronauts are, sudden unexpected deaths happen much more often, all the time. It just doesn't get the notoriety that shuttle accidents do, and we accept that risk and loss.

The conclusion by the Columbia Accident Investigation Board, the CAIB report has finally offered us a much broader vision on this issue: "Operation of the Space Shuttle, and all human spaceflight, is a developmental activity with high inherent risks." These words are worth bearing in mind, as future spacecraft that are developed to ferry humans to the moon and Mars will be inevitably new types of spacecrafts that must satisfy even more harsh flight conditions than did Apollo or the Shuttles. So, while rigorously pursuing the scientifically based methodologies to ensure high standards of safety, we must not only rectify biased risk perceptions about human space endeavors, but also taking a more balanced psychological and moral stand to evaluate human sacrifices against all the great benefits it brings about to the advancement of mankind's searching for the space frontier. We could hardly achieve anything if we ignore to educate the public and the policy makers. If these issues are not fully recognized by the society as a whole, we will certainly achieve much less by wasting more of resources and even wasting more of human lives in the road ahead. We must also fully understand the paradox of the risks and social costs from not taking risks in the human space activities, or in any human development activities. Because throughout human history, the countries and nations that have led the world in exploring have been the most powerful countries on the planet, and they are looked up to for their leadership. This is exactly why we firmly believe that as the world's leading nations on earth, we need a clear vision for R&D collaborative effort in pushing the exploration into the space frontier. And inevitably, by leading the way in space commercialization, physics, biology, and all the other areas that contribute to transforming and keeping all the peace-loving nations as part of the viable and combined economical and technological forces on earth, which ultimately will move mankind outward in the solar system and beyond.

Managing the Technological Risks

For just one hundred years since the Wright brother's courageous test flight, human beings have solved the problems of how to fly and travel safely around our planet. The history of human industrialization has vividly demonstrated with certainty that science, technology and engineering advancement in the space frontier will make the next major breakout of human industrialization from Earth to space. With the sub-orbital flight of SpaceShipOne, history has once again opened up a major frontier in space science and technology development. The so-called "Space Highway" concept and subsequent achievement has brought us yet another historical opportunity in the design, produce, operate and managing the future generations of space vehicles. Meanwhile, how those technologies are developed, utilized and managed in the next few decades are of paramount important in assuring overall safety and mission success for many years ahead. Clearly, we have choices to make. Choices involving develop and apply less than matured or less experienced frontier technologies, which undoubtedly contain high risks. Not just the political, societal and human risks as discussed earlier, but involving the overall intelligent management of technological risks as well. Both the costs and the potential benefits to humans out of the space exploration endeavor will be huge. Therefore, how to manage technology risks and trying to strike a balance between these intricate, yet interrelated cost-benefit tradeoffs has become one of the most difficult challenges for human beings to overcome, especially in the new age of space movement. In probing such complex issues as technology risk assessment and management for systems and human safety in the space frontier requires us to look back and learn from successes and disasters of human space activities in the past. The safety management approach (SMS) is: (a) Prevention by design; (b) Management of risk; (c) Mitigation of consequences. Preventing accident, reducing disastrous consequences and ensure human safety under limited resources have always been in the center stage of space technologies development in the manned space flight programs. More often than not, there are echoes and strong patterns exist among all the past human tragedies from development and operation of complex technology of human-machine systems, regardless of whether they are in the space, transportation, nuclear, defense, marine or petrochemical industries. We know afterwards why we failed. In the technology endeavor of new frontier, we observe risky outcomes all the time, and the more spectacular or costly ones are heavily reported in the media. The names of some of these are so famous that we almost do not need anything more as a descriptor to provoke our thoughts and fears. Just their utterance conjures an instant memory, an image or a pre-conditioned emotional response. Recent major events are startling, quite complex and diverse, and made headlines and caused huge inquiries. Are all the outcomes (events, disasters and accidents) in some way related? Like all other accidents, could they have been prevented beforehand instead of just understood afterwards? Are they just actually expected events, the standard outcomes in the ongoing technological dance with humans? Should we expect them not to occur in the future? How does the pervasive role of the human affect the outcomes? On the other hand, are these outcomes just the usual risk of our being around today, as depicted by Charles Perrow in his "Normal Accident" theory(1)? Could we manage such technology risk somehow? Is there really an accepted risk? Would a safety management system or organizational mechanism, as suggested by La Porte in his HRO (High Reliability Organizations) theory(2) prevent them? What is the chance of another such event? How can we track these risks so we could improve safety? What should we expect in the future?

As we carefully observe and examine these accidents and major disasters, in each case we understood the causal factors afterwards. We pointed the fingers of derision and blame, but neglected and forget the underlying reasons. They were all avoidable, if we had been smart enough to see the warning signs, better manage the situation, and to predict and reduce the risk. In addition, these accidents were in and to our most advanced technologies. To the finest designs we had produced, in major and highly visible applications, often in full view of modern cameras and news stations. Their names provide strong echoes that still sound in the industries in which they happened; of designs abandoned, staff discredited, managers removed, inquires conducted, victims mourned, reports produced, and then fading and forgotten except by those impacted or directly affected by the outcomes. Yet they were all preventable: after the fact, we knew or found out what has happened. A combination of circumstances occur, unique but unforeseen, combined with very human misunderstandings, mistakes and misperceptions that lead to the observed outcome. There are thousands of everyday auto crashes, fires, floods, ship sinking, explosions, collisions, oil spills, chemical leaks, medical malpractices, financial losses, train derailments, falls and industrial accidents. Some causing loss of life, some not. Some makes headlines for a day, some not. But none predicted beforehand. There are so many we forget them; they become part of the background noise of our technological society. They are catalogued, recorded, investigated, reported and forgotten. We assign liability, blame and damages, and the older events fade from our lives and the headlines. We move on. We hope that we have learned something from the outcome, and that it "cannot happen again". And if we are thorough or prudent, then it will not. Particularly, if we remove whatever it was from mis-operating again in the same way, but making a local change so that particular outcome cannot physically happen, or changing out the management and the procedures. Evidently, we cannot only focus our resources on the very physical cause of the Columbia accident, and we should rectify the root causes of organizational and human failures in a systematic, integrated and overall balanced approach. Otherwise, it is likely that should the accident strike us again next time, it will surprise us with a very different physical cause, a cause may not be the launch debris falling off the ET (external tank) again, but a cause may well rooted in the same human environment that harbors the making of the previous disaster!

Despite our best efforts, another similar outcome, an event or error, will occur sometime, somewhere, in a random but at the same time a systematic way. It will happen because humans and other circumstances will conspire in another unforeseen way to make it happen; with different consequences, in a different place, and at a different time. We observe the historic frequency of such events, but we really need to estimate their future probability of occurrence, the future risk. These echoes of common accident phenomena, the outcomes and their recurrences, raise major questions about our institutions, about our society and our way of handling the use of technology, and about their causes, and the responsible and best way forward. Each can or has significantly paralyzed the programs, like the case of NASA's SSP (space shuttle program), as well as the severely impacted industry like the case of commercial nuclear power, where these accidents occurred, causing inquiries, recriminations, costs, litigation, regulations and most regrettably, the devastation and down hill of the pertinent industry in general.

But do these events share a common basis? Should we have expected, or even better anticipated them? Are we truly learning from our mistakes like these and all others that just "happened"? What should or can we do to ensure these crippling losses are not in vain? How can we predict and thus prevent the next outcomes? Or should we continue to analyze them the way as usual, which has been outcome by outcome and event by event without truly understanding the systemic and deep reasons why we have these echoes of accidents? We must understand that these events (and hence most others) show, demonstrate and contain the identical causal factors of human failures and failings, and as a result, the same general and logical development path. It is quite clear that although the direct physical causes of the space shuttle Challenger disaster was different from that of the Columbia accident, however these two tragedies are intimately related to one another, in which the same type of human errors, organizational and communication failures existed throughout the history of the SSP that ultimately caused the tragic fate of the two out of five orbiters ever built. Without elaborating on the needed changes of technical approaches and methodologies for how the Space Shuttle safety risk is assessed and managed, one thing we must make ourselves absolutely clear: The underlying SoS (system of systems) architecture in the SSP, is quite an unhealthy one if not at all flawed, and we must fix it before another accident strikes us. A simple criteria to assess whether or not a complex technological system risk is well handled or miss managed in the course of human space endeavors (regardless of its failures in design, planning, engineering, operations or program management), is simply to observe the frequency of accidents and their major consequences, and to monitor accident sequence precursors (ASP), and near misses with rigor. A record of two fatal crashes out of a little over a hundred total shuttle missions is obviously unacceptable not only to our technologists, but unacceptable to our society and the general public as well. The probability of failure is now known to be about one in fifty for any and all launches. Yes, we may be able to tolerate some level of acceptable risk and taking on a considerable dose of inherent technology risks in any space exploration and developmental activities, but certainly not the risk levels as experienced by the SSP; a roughly 1.80E-2 fatal accident frequency, which is a whopping 6 orders of magnitude higher than the combined airline fatal accident risk (i.e., 1.9E-8) throughout the entire history of commercial flight industry, which in itself a frontier technology at the time barely half a century ago. Part of that risk change simply from the greater experience (numbers of flights) that have been achieved in commercial aviation: but if we must have a million space missions to attain that same experience level, we certainly cannot have 10,000 more accidents in space while we do so!

A new frontier in technical risk assessment and management in space activities is the understanding of complexity of the combination of events and their physical and phenomenological dynamics in the processes of accident occurrence. In simple words, if there is anything we know, it is that we do not really know anything at all. Our conventional understanding of reality is based on approximate physical "laws" that describe how the universe we observe behaves. Particularly, the dynamics of human elements and the complexities of human mind, when coupled with complex technological systems that have been created by that same mind, produce both outcomes we expect (results and/or products) and some that we do not (accidents and/or errors). Since one cannot expect to describe exactly all that happens, and since we only understand the causes afterwards, reactively assigning a posteriori frequencies, any ability to proactively predict the probability of outcomes a priori must be based on a testable theory that works. This is true for all the accidents that surround us, because of the overwhelming contribution of human error to accidents and events with modern technological systems. The human error events and the failures of human understanding and communications are what cause them that are the common human elements in the echoes of accidents. But when faced with an error, major or minor, humans always first deny, then blame-shift, before accepting it as their very own. It is a natural survival instinct; it is part of living and our self-esteem. We do so as individuals, and seemingly also as part of our collective societies. Our mistakes are embedded or intertwined as part of a larger "system", be it a technology, a corporation, a mode of travel, a project decision, a rule or regulation, or an individual action or responsibility. They arise as and from an unforeseen combination of events that we only understood afterwards. Their consequences of events can be large or small, but all are part of a larger picture of which we humans are the vital contributor, invariably in some unforeseen way that was only obvious afterwards.

Again, the key issue in space technology safety and mission assurance lies within the thorough understanding of the dynamics of human elements in the risk and safety paradox. The first problem is the human involvement, subtle, all pervading, unrecordable and somewhat unpredictable. We cannot creep inside the human brain; the most complex technological systems ever existed of all, to see what drives it: instead we observe externally what happens internally as a result of stimuli, thinking, actions and decisions that are made inside the human technological machine as it interacts with the situation and the underlying technology. Conventionally, reliability engineering is associated with the failure rate of components, or systems, or mechanisms, not human beings in and interacting with a technological system. In other words, it is the interactive complexity between man and the technology system that we humans have yet to pay nearly enough attention to understand it and resolving it. Therefore, conventional reliability engineering based system safety methodologies fall short in a significant way to handle with issues of human dynamics, as well as the social environment within which the various human factors exist and fatal errors occur. And we now know that this conventional wisdom must change, we must redirect our focus more on the reliability, credibility and stability of human elements within the large environment of technology systems. As computer and software control techniques continue to dominant the design, fabrication of modern technological systems like the CEV (crew exploration vehicle) of NASA's next generation spacecraft, the human element becomes extremely crucial for us to attacking the increasingly complex puzzle of risk and system safety management. All events (accidents, errors and misshapes) are manifested as observed outcomes that may have multiple root causes, event sequences, initiators, system interactions, or contributory factors. It is the involvement of human that causes the outcomes we observe and which determines the intervals: rarely is it simply failure of technology itself. Figure 3.1 provide some strong evidence that the human error induced fatal accidents are dominant ones among all other causes in the global airline industry. The Federal Aviation Administration (FAA) has therefore noted:


"We need to change one of the biggest historical characteristics of safety improvements — our reactive nature. We must get in front of accidents, anticipate them and use hard data to detect problems and the disturbing trends".

This is a very clear message to us all, and to be proactive requires new theory, innovative methodology and understanding of the stochastic nature of error causation and prevention. Simply put, we need to act before the next accident strike us, and we must find accident potentials before they find us! Our fundamental work on the analysis of recorded events in major technological industries shows that when humans are involved, as they are in most activities of space flight:(3)

  • a bath tub-shaped Leaning Curve (ULC) exists which we must use to track trends;
  • in technological systems, accidents are primarily random, hence are hard to predict;
  • accidents have a common minimum attainable and irreducible value in its frequency
  • apparent disparate events in systems share the common basis of human errors. This is demonstrated again, in the case of the two space shuttle disasters. Embarrassingly, unexpected and in full public view, the US humbled itself and its technology, first, launching on a too-cold day, and then, attempting re-entry with a damaged vehicle, a risk of debris falling during launch and causing orbiter damages, existed for years. In the Columbia inquiry, it was stated that there were "echoes" of the previous Challenger disaster present. Echoes that the key contributors were the same, the safety practices, internal management structures and decisions, institutional barriers and issues, and lack of learning from previous mistakes and events. Therefore, if without innovative safety technologies and the comprehensive risk management practices based on the in-depth understanding of human elements, if we continue the way of conducting business as usual, and identifying problems and making necessary changes only after major accidents, we are doomed to rediscover event-by-event, case-by-case, inquiryby-inquiry, accident-by-accident, error-by-error, tragedy-by-tragedy, the same litany of causative factors, the same mistakes, the same errors.


Figure 30.5. World-wide fatal accident causes by category in color section.

Perhaps not in the same order, or have the same importance, or the same type, but the same nevertheless. So the preventable becomes the unforeseen, the avoidable becomes inevitable, and the unthinkable, the observed, time and time again. We may call this deadly cycle of reactively and passively dealing with technology risks, the "death trap paradigm" (DTP) in the context of system safety and risk management, of which we need to do everything necessary to avoid stepping into one.


Managing Programmatic Risk in Strategic Decisions

The Space Shuttles are considered by many to be the most complex space vehicles ever designed, operated, and managed by NASA. Yet the shuttle will be retired soon, to be replaced by even more complex space vehicles and mission systems: NASA's ambitious crew exploration vehicle (CEV) will be the most complex the world has ever seen. The issue of enterprise risk management has thus become increasingly important, and the focus is now on programmatic risk management for strategic resources, and for adequate control in cost and schedule on long-term, complex, highly sensitive, or capital-intensive national or international programs. The key to enduring success of large and complex space programs, like NASA's constellation program (manned lunar mission and beyond) not only lie in the sound management of technology risks as previously discussed, but in the appropriate assessment and optimal management of programmatic risk as well. It is therefore equally important in utilizing the Programmatic Risk Assessment and Management (so called the PgRAM) techniques and tools to assess risks and their tradeoffs between all available alternatives when large program decisions of high strategic and national importance are at stake.

The history of human space programs might have taught us enough lessons in this regard, and might have reminded us to ask such a harsh question: What would the US space frontier activities have looked like now had we paid enough attention on this very issue of risk tradeoffs some 35 years ago? It may not be so hard for us to comprehend today, that, if we had adequately utilized such scientific methodology like PgRAM back in the late 60s, when serious national and strategic decisions had to be made on the fate of Apollo program in lieu of the space shuttle program, we perhaps, could have avoided a whole spectrum of serious enterprise problems.

Figure 30.6 A View of a Risk Scenario in Long Term Program Planning

These include major setbacks later on the space shuttle program, such as the severe overrun of costs, the miss-achieved design objectives, the safety and financial disasters, the prolonged unavailability of American access to the ISS (International Space Station) and most of all, the loss of time and wasting of national resources, including the loss of most invaluable asset - the program memory, the human and corporate capital of expertise and experiences. Figure 30.6 shows a simplified view of risks in long-term program planning. As indicated in the simple process for laying out risk areas in various aspects of program assessment, benefit analysis and environmental impacts are the important elements to be considered for long-term program risks. Again, if NASA did enough serious homework on this area in the past few decades, we perhaps could have avoided several expensive investments on a number of space vehicle programs that were either short-lived or wasted, like the X33, X34 and X38 projects etc. We might have also avoided wasting money and time on switching between launch vehicle R&D efforts on the 2nd-Gen SLI (Space Launch Initiative on 2ndgeneration launch vehicle), CTV (crew transfer vehicle) and the OSP (orbiter space plane) etc. even though they all appeared attractive at the time. As systems engineering and project management take on the challenge of fulfilling the project and mission requirements, areas with high programmatic risk will become the focus of program management attention to insure appropriate visibility and effectiveness in how resources are allocated. The major objective is to eliminate, as early as possible, those project uncertainties that can result in unexpected growth to cost and schedule. We must distinguish programmatic risk, which is closely associated with a project's budget, schedule, and performance requirement; it should not to be confused with the Technological risks to the safety of workers, general public and engineered systems within the confinement of the systems of the entire technology environment. As an example, take the case in the enterprise risk management of NASA's national-based Space Shuttle Program (SSP) in coupling with the International-based Space Station Program (ISS). A set of programmatic high risk scenarios could have been identified, analyzed and well anticipated with optimal and well prepared strategic responses, back in the early days when strategic program decisions, national and/or international enterprise architectures were made and determined. At the very least, if we were making more intelligent program decisions at the enterprise level with the help of such technologies as probabilistic risk assessment (PRA) and PgRAM, we could have avoided: i) the very wasteful premature shutdown of the entire space shuttle program, ii) getting ourselves into the most difficult situation for having to buy American access to the LEO (low earth orbit) in order to complete the ISS assembly, and iii) the forced cutting backs from intended scope of ISS design. These all caused loss of credibility on international partnerships and incurred unexpected waste of societal investment, in terms of curtailed ISS program, which reduces its scientific mission value. Most importantly, we could have accumulated highly desired design and operating experiences over the many years with much needed technical data on the space vehicle systems, which would have been not only pertinent to our present needs for the design and operation of long duration deep-space crew exploration vehicles. But also we could perhaps, have made the space shuttle vehicle system the technologically and economically most viable candidate to carry on forward, on the continued course in human exploration in our solar system, ferrying mankind to Mars and much beyond!

The programmatic risk approach was primarily derived from the PRA techniques used for system safety and mission assurance studies. It was particularly developed in response to failures of traditional analysis methods (like FMEA and QMS) in the planning and undertaking of complex projects and programs. In the past, poor projections of future program performance were not caused by lack of care but were the results of unrealistic analysis assumptions and ad hoc treatment of uncertainties in conventional techniques. A key contributor is that the performer, often a Program Office that may have substantial biased interests engaged solely to show program success, prepares the projections. Achievement of programmatic milestones then takes precedence over strategic program effectiveness. Therefore, it is extremely important that during the process of making strategic program decisions, an external "out of the box" independent viewpoint is an essential component of the PgRAM process. This oversight must not just be advisory: it must set directions and make effective recommendations and take efficient actions. Rigorous quantitative risk assessment and management processes at the enterprise level can truly provide an analytical framework for assessing and applying scarce resources to the management, resolution, elimination, and disposition of the highest priority program risks. It aids in identifying potential risks and quantifies the impact of each one on a program or an organization within the underlying enterprise. The process also aids programs and organizations in balancing their competing risks and optimizing the use of scarce budgetary or human capital recourses in order to eliminate much of the uncertainty from the vital decision-making process. The history of past high valued programs has repeatedly shown that performing enterprise and programmatic risk assessment and management is not only necessary for strategic decision-making. It is also an ideal vehicle for system modeling of real world problems by systematically incorporating uncertainty for each of the subtle dependencies among these areas into formal model. Programmatic risk assessment approach differs significantly from conventional deterministic analysis in its explicit inclusion of uncertainties in the quantification of predicted program outcomes. Decision-makers have always recognized the existence of uncertainty about the information available to them and have had to learn to deal with uncertainty intuitively in the past. However, a sound PgRAM process significantly improves the decision-makers' state of knowledge and provides a more objective basis for decisions. We may also deal with unknown risks (the "unknown unknowns") utilizing subjective probabilities that can bound the outcomes, and enable relative judgments on risk importance. The most important aspects of the programmatic risk management are its ability to:

  • account for various kinds of enterprise or program activity constraints, such as dependencies, intra-activities and inter-activity dependencies (much like the intricate dependencies between projects across SSP and ISS programs of NASA), * effects of external conditions (e.g., impact of government regulations, international treaties and political environment, much like the multi-national dependencies of projects within the ISS international partnership framework), and also
  • uncertainty in the input parameters, and the set of possible complications that may occur.

Such is the case that the Russians are now asking NASA to pay for the rides to allow the ISS access because there left no options for the US. So, we should have assessed the risk scenario of having no other options but to paying for access to the LEO should the continued operation of the Space Shuttle became an issue. The result might have served to the better interests of US or all other ISS partners to including another international partner in the enterprise of such joint multi-national ISS program, who might posses, the capability of manned access to the ISS at a much lower cost. In short, a sound programmatic risk management unveils the true risk of programs so that decision-makers can make informed decision about potential alternative course of action. Figure 30.7 provides a typical flowchart of risk assessment for strategic program or enterprise decision-making process. Details of such technique will be discussed in a following book.

Figure 30.7 A Schematic of Programmatic Risk Assessment Approach

What It Takes to Achieve Safety and Mission Success?

With the rapid development and advancement of technologies in recent years in the system safety and assurance engineering field, as largely led by the commercial nuclear power field(4~6), mankind has achieved a great deal to assure safety and system reliability in large and complex technological systems and space program operations, despite of the setbacks and major accidents. The last few decades, in particular, have seen increasingly widespread use of PRA and management techniques (SMS) as aids in making complex technical and programmatic decisions on major space initiatives. The current "constellation program" of NASA and its rigorous assessment of CEV, CLV system architecture and design concept with respect to crew safety and vehicle reliability, culminates such effort in the right direction.

However, despite the progress that has been made in risk science and safety technology, many unresolved issues remain. There still exist numerous examples of risk-based decisions and conclusions that have caused great controversy. In particular, there is a great deal of debate in the space community surrounding the benefits PRA and of management in decision making process; the role of values and ethics and other social-political factors to safety and risk; the efficacy of quantitative versus qualitative analysis; and the role of uncertainty and incomplete information.(7~8) In addition, these debates become even more controversial around issues on how we look ahead and move forward on tough challenges in the new age of commercial space flight for safety assurance and mission success. What are the most promising solutions for system safety? Is it merely a technical issue that relies only on safety technologies, optimal design techniques, operational and organizational approaches? Or is it simply any political or organizational solutions that could help in achieving a maximum safety goal, yet still allow commercial space travel affordable to the general public? How do we regulate the imminent boom of the incoming space flight industry, the emergence of space entrepreneurships and the space technology revolution? Is there truly an acceptable risk metric for human safety in commercial space travel or in manned exploration activities? While some implications of insights to better understand of these questions are discussed in previous sections, it is the intent of the authors however, in the reminder of this chapter, to attempt summarizing and sharing with readers some intriguing thoughts, which may shade some lights on these very issues of uttermost importance in our time.


We summarize, in the following five major areas that mankind must pay great deal attention in the years ahead, in order to unveil the paradox of human safety and risk taking, and ultimately to achieve the safe and affordable undertake of human space activities, and making commercial space travel a safe and near-future reality for us all.


the compounding factors of human dynamics — a centerpiece in safety system-based concept and systems architecture analysis — the foundation smart decisions with Integrated Risk Management framework design-based safety philosophy — reduce & eliminate accident by design gaining insights by continuous R&D in accident theory development


These five aspects of high technical and programmatic importance to safety and assurance engineering activities, can be easily depicted in Figure 30.8, thus graphically defined as "the 5-block foundation" which serves as a R&D reference or principle activity guide for safety and mission assurance (SMA) programs. We now elaborate briefly on each of the five areas to further explain the main issue of concerns, which must be addressed to achieve ultimate safety and mission successes of any complex socio-technological systems in space programs, including any capital intensive engineering systems with high human safety risks in other industries. Detailed discussion on the "5-block SMA foundation" will be provided in a separate chapter.

Figure 30.8. The 5-block foundation in Safety & Mission Assurance Activities


The Factors of Human Dynamics — A Centerpiece in Safety Assurance

The assessment of complex and compounding human behavior and factors are the dynamic elements in space programs, and must always serve as the centerpiece in assurance engineering activities for human safety. Although notable progress has been made in this area, more research and development (R&D) efforts are needed to meet the new challenges of the complex semi-automated homo-technological systems of today, and they must be focused on the very dynamics of the human element within the safety and risk equation. We need to fully understand, through these efforts, the profound impact of all the human factors, the socio-political and organizational influences on human safety, risk and risk perceptions within the context of human systems engineering. Adequate adaptation and implementation of safety requirements, standards and organizational policies focused on the human aspects rather than on just the mechanical systems under appropriate and enforceable regulatory framework are of uttermost important to our future success.


System-based Concept — Architecture in Systems Analysis

In the past, there have been generally little or no attentions devoted to the application of system-based view or SoS (system of systems) architecture in the process for safety and risk assessment and management activities. It is therefore crucial that we make use of systems theory and system analysis tools and methodology to safety and mission assurance assessment, rather than focusing resources only on isolated reliability requirements or hazard assessment effort focused solely on critical components and systems. Utilizing concepts from systems theory, one can approach risk assessment that actually begins with a system identification and understanding of system architecture, a step that must be carried out by the risk managers and program decision makers.

Smart Decisions by Integrated Risk Management — A "Double-T" Concept

New challenges in SMA activity require a stronger focus on utilization of innovative risk management process and practices. In the aftermath of the Challenger accident in middle 80s, and especially after the recent loss of Columbia, efforts were expanded profoundly in space agencies like NASA, on applying quantitative risk techniques such as PRA. The conduct of the present space shuttle PRA is one culmination of these efforts(9~10) although it simply highlights the present known unacceptably high levels of risk. Nonetheless, as pointed out by the CAIB report(11~12), a key critical weak-link exposed by NASA's safety & risk management practice has been the lack of a comprehensive and integrated SMA methodology. In an attempt to address such broad SMA issues from the technical risk management perspective, a SMA management framework based on integrated risk assessment process has been developed. The centerpiece of this methodology, as illustrated in Figure 30.9, is the "Triple-triplet" concept of an integrated risk management process. It largely extends the risk assessment triplet(7) to include and combine all aspects of SMA elements within a systems engineering framework. With the Triple-triplet (Double-T) integrated risk management framework, best risk tradeoff decisions can be achieved with risk insights not only from PRA but also from engineering and system safety activities all integrated into a system-based single and complete process. The Double-T conceptual framework for integrated risk management is only briefly illustrated here with the expectation that the interested readers may pursue further on technical details as cited in references(13).

Figure 30.9. A Triple-triplet (TT) Conceptual Framework for Integrated SMA

Risk Assessment Triplet — Gaining PRA Insights:

  • (1) What can go wrong?
  • (2) What's the likelihood?
  • (3) What are the consequences? System Safety Triplet — Gaining Engineering Insights:
  • (1) What are the hazards?
  • (2) What's the requirement?
  • (3) What's the compliance? Risk Tradeoff Triplet — Making Tradeoff Decisions:
  • (1) What's going on?
  • (2) What can be done?
  • (3) What's the impact?


Design-based Safety Philosophy — Eliminate accidents by design

The best engineering effort for system safety starts with the superior design philosophy, or the design concept selection process. Using both deterministic and probabilistic safety analysis tools for system architecture and concept studies in early design stages is critical in system safety assurance. Clearly, if a design is pregnant with a potential for disaster (a latent error, for example), the chances are accident will happen sooner or late regardless of the best engineering effort later in the program. Such as the launch debris-shedding problem rooted in the external tank design that eventually caused the loss of space shuttle Columbia. Another interesting example occurs also in the space shuttle design, where the concept of combining a cargo ship and crew vehicle into a tightly-coupled gigantic space transport truck is highly controversial with respect to safety risk over cost-benefit ratio. This likely flaw in concept selection philosophy might have been a contributing factor to the doomed fate of the entire program. Another design philosophy has to do with the consideration of variations and unknowns and to what extent they are considered. Should the design be deterministic, or probabilistic, or both? Also, criteria are fundamental to design, as well the philosophy implied within the paradox of safety and risk, or cost and reliability. One of the critical design philosophies is the essential role that failures play, thus leading us down the Universal Learning Curve. Petroski(14) has expounded this philosophy and stated that the design process must consider in depth all potential failure modes. This is true even when the new design is a moderate scaling up of a current successful system where the slightest design change can introduce effects that lead to catastrophe totally unexpected. In recent years, use of PRA in design analysis in space programs has gained widespread acceptance because of its probabilistic treatment of uncertainty is more robust than that of the deterministic based safety factor techniques, such as the use of worst case loading and 3 (or more!) sigma safety margin assignment etc., as have always been in the past. To help strike that balance between the desiderata and the actual design, the following ten most meaningful design philosophies or principles for accident avoidance and system safety are summarized below to share with our readers:

  • (1) Inherent simplicity and flexibility in system design often implies safety and lowest cost, and is perhaps the best design approach for safety, cost effectiveness and mission success.
  • (2) Prevention of accidents using defense-in-depth system design, such as use of multi-phase and multi-layered protection mechanisms and structures whenever possible.
  • (3) Preclusion of high consequences due to human error or reduced potential for any human induced fatal events by design, and by thorough understanding of the dynamics of human element, human factors and using sound design practices.
  • (4) Extensive use of "fail safe" and/or "fail operational" philosophy to guard against fatal consequence in case of total system failure.
  • (5) Mitigation of consequences by diversity (addressing common cause failures - CCF(15)), and avoiding tight-coupling of interface design, but use loose-coupling, or decoupled and a modularized design philosophy to isolate, contain accident propagation whenever possible.
  • (6) Use redundancy and always minimize the potentials of critical points of failure whenever necessary to guard against random failures or failures due to wear, fatigue or aging, but only if common cause failures is an unlikely threat.
  • (7) Use of passive-design safety concept in total elimination of accident or severe consequence potentials. Such as the advanced nuclear reactor design that the design precludes the possibility of core melt in case of total system failure.
  • (8) Use of probabilistic-based design philosophy for concept scrutiny and setting safety targets, including whenever necessary the treatment of uncertainties, as the best design practice for eliminating unknowns, and minimizing the number of variables with uncertainties as much as possible.
  • (9) System-based and systems engineering (as opposed to a component based) view in design concept assessment, selecting and analyzing integrated designs within a framework of system-of-systems (SoS) to avoid designs resulted from isolated view that only focused on technology side. (10)Define the boundaries and cliff edges of the design as much as possible. And always believe in the philosophy of robust design, and choose conservatism (additional safety margins, features or systems) if it is your only option other than staying near or on the edge.

Probabilistically, we know that we may design systems to 10E-3 to 10E-6 failures over demand (reliability), by using redundancy, diversity and simplicity of inherent safety features. The dominant failure modes then always become:

a) Common mode failures — a unifying failure inherent in the design that disables many systems at once and hence can fail the whole, such as a fuel tank seal failure, power system failure, thermal tile damage or an oxygen atmosphere explosion

b) Homo-technological error — a failure of the system design that enables bypassing the safety design intent by humans avoiding, allowing, overriding, ignoring, denying, misusing, misunderstanding or misinterpreting, such as using incorrect avoidance procedures in mid air collisions, or not knowing about tile damage effects on re-entry.

Gaining Insights by Continues R&D in Accident Theories

One of the most important aspects in assuring safety and preventing catastrophic events in space or any other high risk programs is to understand and learn theoretic basis and insights about why accident occur, and how it propagates. Although there has been long debates within the safety science and technology community on issues concerning (a) why catastrophic events always happen regardless of how rigorously human beings have tried to prevent them, and (b) how much engineering effort and human resources have to be allocated to such effort. However, there still exist a great deal of confusion on this very issue among the managers and decision-makers of many high-value engineering communities. For instance, a considerable number of engineers and managers within NASA community are still skeptical about why rare event happens even with low likelihood (For the answer, see Figure 30.2). And why space shuttle orbiter safety continues to tumble whereas extraordinary amount of money and time already spent on FMEAs, hazard reviews, hardware tests, and risk and reliability analysis? Undoubtedly, only after thorough understanding of fundamental theoretic basis of the prior knowledge of how fatal accidents are increasingly happening nowadays, could we then gain critical insights from these theories, and effectively prevent and even predict future events by developing the right techniques and making right safety and risk decisions.

One of the key contributions made by Perrow in his Normal Accident Theory(1) (NAT) was the identification of the risk-contributing system characteristics, which offer significant implications on how we ought to design and operate of complex socio-technical systems. It pointed out that the combination of high complexity and tight coupling must lead to failures. However, the notion implied in this theory that we cannot assure the safety of complex technical system designs is not only a pessimistic proposition but quite disturbing to design engineers and those who operate and manage such systems. (The bathtub model and Learning Hypothesis gets one out of this conundrum). A major flaw however, in the NAT argument is that the only engineering solution he considers to improve safety is use of redundancy, despite of the paradox that redundancy introduces additional complexity and renders higher safety risk. It also ignores the human contribution, which is unchanging. Others have argued that the safety of every engineering design is a hypothesis only to be tested during commissioning and/or operations. The issue is that not all complex systems could be fully tested; such as exist in the spacecrafts and weightlessness. The Apollo 13 near fatal accident might have been averted had they tested what would happen if two oxygen tanks would be lost. No one believed that a thermo switch failure inside the O2 tank, induced by a pre-flight ground test was even a remote possibility for the incident. In spite of the pessimistic views contained in NAT, can we somehow organize ourselves very effectively to manage avoiding accident then? Another school of thoughts of accident theorists gave a very positive answer to this question, as represented by K. Roberts' work2 known as the High Reliability Organization (HRO) theory, which argues that we can achieve reliability with appropriate attention to organizational management. According to HRO theory, it is not really clear that all high-risk technologies will fail. The Human Bathtub in Figure 30.10 explains this difference clearly - failure is a probability that varies with experience, and hence we can learn and reduce the rate faster by having a learning Organization that then is labeled a HRO. In contrast to NAT, HRO theory counters Perrow's hypothesis by suggesting that some interactively complex and tightly coupled systems operate with very few accidents while under highly organized "high reliability" institutions. This is consistent with the presence of a Universal Learning Curve. A most noted problem with this theory is that the systems studied by HRO researchers are neither interactively complex nor tightly coupled systems. Furthermore, the theory of HRO implies that, as far as high reliability is to be achieved in every aspect within a well-managed organization, it will no longer have accidents and therefore safety is warranted. The problem here is clearly the confusion between reliability and safety, since high reliability does not necessarily mean safety, and accident can still happen on a perfectly reliable system, and highly safe systems are not necessarily reliable. In the loss of Mars Polar Lander case of JPL in 2001, everything performed reliably as designed, but the Lander crashed into Mars during entry simply due to a design flaw in which the designer failed to account for all the complex interactions between the leg deployment and the control software. Another example could be when there is need for system operators, say airline pilot, sometimes to break the rules on predefined flight procedures in order to prevent accident, like a fatal collision during landing by averting the plane to a different runway or a nearby field. Conversely, in the case of the aircraft midair collision over Switzerland, the (human) ground controller instructed one of the planes to adopt a collision course, which was followed despite warnings to the aircrew from the collision avoidance system. Again, if the three Apollo 13 astronauts followed all the crew procedures provided to them in their training for being "reliably" performed what was documented prior to mission, they would never had any chance of surviving. We must distinguish between Reliability, Safety and Risk, which are different qualities and should not be confused. In fact, these different qualities often conflict. Increasing reliability may decrease safety or incurs more system risk and likewise, increase safety may just decrease reliability or reduces system risk. A significant challenge of engineering that deals with this issue is to find ways to increase safety without decreasing reliability. We have seen managers and lead engineers of high risk systems, who are so obsessed with focusing much of their daily effort on reliability estimates or predictions, and often forgot about "what is going on" by looking at the "big-picture" conditions of their systems - a critical insight as being one of the "triplet" element introduced within the "Double-T" concept.

Figure 30.10. The Human Bathtub — Probability of an Organizational Failure

Obviously, both theories of NAT and HRO although provided very important insights on how we should deal with high-risk technical systems in order to achieve safety, yet they fail to offer better solutions which would complement the shortfalls existed in each of the theories. Unarguably, complex socio-technological systems not only need more sophisticated accident theories and approaches to resolve safety and reliability issues. They also demand continued R&D effort by mankind in the theoretic frontier, especially for safety challenges on the trans-scientific, non-random, technical, and organizational factors involved in accidents. Recently, there have been strong advocates of a system-based approach to safety(16), of which a top-down systems view is proposed in favor of the bottom-up, reliability engineering focused standard engineering approaches. In such accident theory, the traditional conception of accidents based on event sequence or chain-of-events of directly involved failures and human errors is abandoned. Chain-of-events models encourage limited notions of linear causality and cannot easily account for the none-coherent system events, such as the indirect, nonlinear and feedback interactions common for accidents in complex and dynamic systems often seen in the space industry. A systems view of understanding accident causation allows more complex relationships between events and also provides a way to look more deeply at why the events occurred. One of the weak points with the systems approach to safety, as advocated lately(17) is that it only emphasizes the top-down systems view to understanding and modeling accidents, which is mainly based on the deductive logic process that allows us a big-picture view within a broader system context. However, it ignores or underestimates the effectiveness or useful insights which can also be offered by the bottom-up inductive-based logic processes in accident modeling practice, such as the FMEA, CILs, Event tree which have long been widely used in the aerospace community. So once again a balance is required. In particular, the identification of critical (single) failure points was predominantly accomplished by these labor intensive, detailed engineering-based qualitative safety risk management infrastructure in NASA, and they are considered effective by many NASA engineers because they had served so well during the Apollo and early space shuttle era. As technologists experienced working in both fronts of theory and modeling practices in the safety risk filed, we have found that despite being strong advocates of systems-based theory and approaches, applying inductive-based modeling process can also help identify major accident scenarios, allowing us to better understand the inductive nature behind accident sequences. In light of these issues and others as pointed out in NAT and HRO accident theories, an alternative approach has been introduced, which calls for systems-based view as foundation but making use of combined risk insights from both top-down and bottom-up approaches for understanding and modeling complex accident process.(13) Such a System-based Dual-Process (SDP) approach for integrated risk modeling steams from the "Triple-triplet" (Double-T) concept, as discussed briefly in previous section, and the details of its theoretic basis is beyond the scope of this chapter. Interested readers should pursue further details in other references.(10,13) The significant benefits of the SDP approach to safety however, result from the combined insights gained from other accident theories like NAT, HRO and general systems approach. Several key elements of the SDP approach to safety and accident modeling, which can be summarized in these four aspects:

  • Adopting a Systems-based view to accident that focuses on the socio- homo- technological system as a whole, and understanding and managing the interactive complexity by modeling the relationships between technical, organizational, human and social-political aspects.
  • Understand and modeling accidents with "dual-process" logical thinking, from both Top-down deductive logic for overall system architecture and performance, combined with the complimentary Bottom-up inductive logic thinking, which is scenario-based focusing on accident scenario identification and reliability engineering aspects.
  • Emphasis on the dynamics of human aspects for managing uncertainties in safety and reliability organizational infrastructure by taking full insights and advantages as offered by the HRO theory. Fully recognize that HRO does provide insights for improving human, and system reliability
  • Applying System-based Integrated Risk Management Framework based on the Triple-triplet concept, and managing total safety risk by making decisions based on integrating all phases of a system life cycle: from design and concept selection using NAT risk insights to engineering, operation and program management.

Looking Ahead

History has brought mankind to the brink of an unprecedented era of opportunities for science and technology advancement in the space frontier. With the continued escalation of mankind's quest for unknowns outward, it is inevitable, as headed by ambitious US space endeavors that a whole new array of worldwide space sciences, commercial space flight industries and space commerce are emerging on the horizon. There are, of course, full of uncertainty and risks on the road ahead: uncertainties that might well beyond human comprehensions, and risks that almost certainly will cost enormous resources, and perhaps human lives as well. Nevertheless, we cannot be afraid of moving forward in the face of risk. Safety risk can never be zero: mankind will never migrate beyond earth if we fail to take risks on the space frontier — paying the probable price for what we gain. This is simply Darwin's law of nature, and wishing to benefit in space without human sacrifice is simply against the law of nature. We cannot imaging an answer otherwise! Human civilization as we know it today would not exist if we were afraid of risk taking. Taking risk is part of human spirit, a key attribute of mankind built into our genetic propensity ever since our very existence. To answer the question, can commercial space travel made safe yet affordable for us all? If we move one million people each year to another planet, or between continents in LEO, and the desire is for no accidents, the risk goal is much less than 10E-6 per year , and with, say, 100 passengers on each flight, it is a goal of less than 10E-4 per flight. This target has been achieved with commercial flights today, but not yet with present launch rockets and shuttles, which are still at the ~ 10E-2 level per launch. We have two orders of magnitude reduction in failure rates that are required to achieve the same comparative risk level. Can we make space exploration safe? Yes, absolutely but this doesn't mean accident will never happen again. Taking risk always come with high returns, yet always implies adverse consequences. We still need a technological advance or breakthrough, which itself carries risk. The benefit of tomorrow from human space endeavors today will never be fully understood, or anticipated, it is even beyond the wildest imagination of mankind capable at the current stage of civilizations.

The key is how we manage risk and dealing with uncertainty. We must realize managing and taking risk is a major part of our daily activities working in the space frontier. The right question is how we can take smart risks and taking it most intelligently? The most important issue here is uncertainty: technical, organizational and social. It is uncertainty that makes engineering difficult and challenging and sometimes unsuccessful. Deciding which outstanding problem ought be giving priority is a difficult problem in itself. Also, because many high-tech systems do use new technologies, understanding of the physical and social phenomena that may cause problems is often limited. Space and most high-tech systems have unresolved technical uncertainty. If it were necessary to resolve all uncertainty before use or operation of any spacecraft systems as politically perceived by many of us within the executive circle, most of high-risk space systems would need to be shut down and important functions provided in our society would come to a halt. Obviously, mankind cannot wait for complete understanding before endeavoring ahead to launch technically complex space systems. The past few decades have seen increasingly widespread use of risk assessment and management techniques in making complex technical and strategic decisions. As safety technologists working in the risk management front, we must fulfill our vital responsibility to the human society. There is no doubt if we continue to advocate use of scientific-based methodology and gaining insights through continued research and development in the theoretical front, mankind will ultimately succeed in living beyond earth and ferrying into the cosmos with safety!

References

  • Perrow, C. Normal Accidents; Living with High Risk Technologies. New York: Basic.1984.
  • La Porte, Todd R. and Consolini, Paula. Working in Practice But Not in Theory: Theoretical Challenges of High Reliability Organizations. Journal of Public Administration Research and Theory, 1, 1991.
  • Duffey, R. B. and Saull, J. W., Manage the Risk: The human element, Butterworth-Heinemann, New York, 2002.
  • Apostolakis, G. E., Bickel, J. H., Kaplan, S., Editorial: Probabilistic Risk Assessment in the Nuclear Power Utility Industry, Reliability Engineering and System Safety, Vol. 24, No. 2, 1989.
  • W. E. Vesely, Editor, "Special Issues on Developments in Risk-Informed Decision Making for Nuclear Power Plants", Reliability Engineering and System Safety, Vol.63, No.3, 1999.
  • Frank, M. V., Probabilistic Risk Assessment in Aerospace: Evolution from the Nuclear Industry. Proc. PSAM5, Osaka Japan, Nov.2000.
  • Kaplan, S. & Garrick, B. J. On the Quantitative Definition of Risk. Risk Analysis Vol.1, No.1, 1981.
  • Azarm, M. A., Hsu, F., Role of Risk Assessment and Reliability Assurance Program for Space Vehicles, Brookhaven National Laboratory Technical Report for NASA Code Q, 1992.
  • Vesely, W. E., Hsu. F. et al., Performance of A Probabilistic Risk Assessment for the Space Shuttle, NASA Johnson Space Center Report, Jan., 2001
  • Hsu, F., Railsback, J., The Space Shuttle Probabilistic Risk Assessment Framework — A Structured Multi-layer Multiphase Modeling Approach for Large and Complex systems, Proc., PSAM7, Vol.3, Berlin Germany, 2004.
  • CAIB, The Columbia Accident Investigation Board, report Vol-I, August, 2003.
  • Fragola, J. R., Space Shuttle Program Risk Management, Reliability Availability Maintainability Symposium (RAMS)96, Las Vegas, NV, January 1996.
  • Hsu, F., An Integrated Risk Management Framework — The Triple-Triplet Concept for Risk-informed SMA Management. NASA RMC, October, 2004.
  • Petroski, H. Design Paradigms, Case Histories of Error and Judgment in Engineering. Cambridge University Press, 1994.
  • Mosleh, A., Rasmusen, D. M. Guidelines on Modeling Common Cause Failures in Probabilistic Risk Assessment, NUREG/CR-5485, Nov. 98.
  • Hatfield, A. J. and Hipel, K. W., Risk and Systems Theory. International Journal, Risk Analysis, Vol. 22, No. 6, 2002.
  • Rasmussen, J. and Svedung, I. Proactive Risk Management in a Dynamic Society. Report of Swedish Rescue Services Agency, 2000.

About the Authors

Feng Hsu

Romney Duffey

Extracted from the book Beyond Earth - The Future of Humans in Space edited by Bob Krone ©2006 Apogee Books ISBN 978-1-894959-41-4