Reliability, Safety, and AV Development

An overemphasis on safety without a robust and equivalent reliability process and organization will result in errors that could be catastrophic.

Lack of a clear, consistent construct in how reliability and safety interact and build upon each other is creating avoidable conflict and potential miscommunication that can put AV customers under unnecessary risk and add excessive systems cost. (Credit: DfR Solutions)

On March 18, 2018, the first pedestrian fatality due to the operation of an autonomous vehicle occurred in Tempe, Arizona. Since then, almost 10,000 articles have been published on this accident, with most of them espousing an opinion on what it all means for future of Uber, autonomous vehicles, public-roads AV testing, and even the larger society.

What is missing from this cauldron of debate is the lessons learned that designers of autonomous sensor, software and platform technologies can extract from this tragic event. Learning from it will be pivotal to the financial success of autonomous vehicles.

A fundamental challenge in learning from the Tempe fatality and in determining the value of ISO 26262 (the functional safety standard for road vehicles) is in identifying the complimentary and contradictory roles of reliability and safety. This is not a matter of semantics: Every manager realizes that process, authority, and responsibility are the core of every software and hardware design cycle. Who does what, who reports to whom and when they do it can it result in dramatically different outcomes.

Introducing redundancy in vehicle sensing, control, power, braking, etc., adds cost without necessarily making the occupants or surrounding traffic much safer. (Credit: DfR Solutions)

What is reliability, what is safety, and how should they relate to each other in a corporate environment? From the perspective of reliability engineers, safety is a subset of reliability. Why? While reliability focuses on the probability that a failure will occur, safety assumes the probability that a failure will occur and result in a catastrophic (loss, injury, or death) event.

Catastrophic events are just a small portion of the overall outlook being managed and tracked by the reliability team. Thus, in a reliability-centric world, safety engineers are managed by the reliability team and do not act until a thorough design-for-reliability

(DfR) activity is complete.

Reliability and Safety interact

As one would expect, safety engineers do not share the same vision. From their viewpoint, reliability analyses only provide probability of failure for a particular failure mechanism (reliability physics) or part (empirical approach). Reliability analyses have no context as to the consequence of failure — will it be catastrophic? Such analyses are therefore most effective when performed at the lowest level of the system. Because consequences are only clear at the system-level, where the response of the system or the user to the failure can be considered, reliability engineers should report into the safety team.

The key function of reliability engineers is to calculate failure rate and basic failure modes. And since, sometimes, these failure rates are only numbers, why have a reliability engineer at all?

A third viewpoint is that reliability and safety are not as related as one would expect. A prime example of this philosophy is how the two disciplines would address fan performance. From a reliability perspective, the actions might be to ensure the fan meets failure rate goals for the expected environment, either through reliability physics analysis (RPA), derating, or accelerated life testing (ALT). From a safety perspective, the actions might be to determine if fan failure would induce a catastrophic event (how it interacts with the rest of the system) and then introduce potential mitigations, such as redundancy or prognostics using drift or change in key parameters (current draw, tachometer, noise).

These different viewpoints highlight the uncertainty among technology companies on how to handle reliability and safety. One major consumer technology company that is transitioning to autonomous vehicles has Reliability and Safety reporting into the same Director. A second company, a leader in the autonomous field, has Safety and Reliability reporting into two different organizations, even though the leaders in both departments have roughly equivalent titles.

A third company, a mainstay in automotive electronics that is aggressively targeting autonomous control units, also has Safety and Reliability in two different organizations, but clearly has a favorite through the numerous executive titles assigned to Safety (while the highest reliability staffer is either Manager or Leader).

Without a clear and consistent construct in how reliability and safety interact and build upon each other, the automotive industry is creating avoidable conflict and potential miscommunication that will either put customers under unnecessary risk, create autonomous systems that are excessively expensive, or both. One autonomous vehicle manufacturer had such uncertain confidence in reliability, or such unlimited authority of the safety team, that it introduced redundancy throughout the vehicle (including sensing, control, power, braking, etc.). Given that the average car has, by some estimates, over $12,000 of electronics, this introduces significant costs without necessarily making the occupants, or the traffic around them, that much safer.

A perfect example of this issue is the divergence between safety and reliability in how to calculate failure rates. From the 1950s through the 1990s, most reliability practitioners in electronic hardware organizations used empirical handbooks to calculate failure rates. These handbooks were simply aggregations of field failure data, sorted by part technology (resistor, capacitor, diode, etc.). While simple in concept and execution, repeated studies demonstrated that these handbooks were wildly inaccurate when used on actual product, with the error leaning towards the conservative — over-predicting failure rate.

The reason was straightforward– these handbooks were not based on the actual mechanisms that cause failure. Fast forward to the 21st century and most skilled reliability practitioners no longer rely exclusively on empirical field data to predict failure rates. Reliability physics analysis (RPA) and accelerated life testing (ALT) replaced these outmoded approaches and nowhere was this truer than in the automotive industry. Until ISO 26262 came along.

Avoiding the disconnect

As a functional safety standard, ISO 26262 requires the computation of failure rates and the appropriate miti-gations to predict the safety integrity level (SIL). And the safety community, unlike the reliability engineers, strongly encourage or even require empirical prediction handbooks to be the basis of SIL calculations. This disconnect is driven by the lack of a universal construct between reliability and safety. Creating separate organizations reporting into separate management has led to a breakdown in communication, causing safety engineers to use outmoded approaches for failure rate calculations.

In addition, without a balance between the two groups, safety teams will tend to prefer higher failure rates, which requires additional safety analyses and safety mitigations including redundancy. Safety’s focus on simple handbook calculations will also result in overlooking critical failure modes, such that safety mitigations are no longer effective.

There is still an opportunity for improvement. Players in autonomous technology, from semiconductors to electronic modules to overall systems, must realize that an overemphasis on safety without a robust and equivalent reliability process and organization will result in errors that will be difficult to untangle.

A good first step is to make sure that reliability and safety are within the same organization, reporting to a neutral observer. Both sides should agree to implement best practices, including use of state-of-the-art simulation and modeling and reliability physics to lay the ground work on appropriate and effective risk identification and mitigation.

Dr. Craig Hillman is CEO of DfR Solutions, specialists in quality, reliability, and durability (QRD) solutions for the electronics industry.