Automating Radiation Effects Mitigation with FPGA Synthesis

FPGA designers of aerospace and defense applications have long wrestled with radiation effects. More recently, due to shrinking technology nodes, concerns have arisen about radiation-related upsets in other safety- or mission-critical applications such as medical devices, networking gear, and commercial avionics. Smaller geometries in FPGAs mean that an energetic particle can more easily alter the current flow and charge storage of a device’s configuration SRAM. In fact, policymakers for DO-254, the design assurance standard for airborne hardware, are now encouraging hardware designers to examine and address radiation-effects issues through mitigation techniques.

Figure 1. Synthesis generates a mitigated gate-level netlist from an RTL design.
Through the years, FPGA designers have acquired an increasing array of tools and methods to address radiation effects. Rad-tolerant silicon is one viable approach. Another is board- or device-level redundancy, where a soft error in one unit is voted out by the other two units. A third approach is hand-coding mitigation circuitry, where the FPGA developer incorporates circuitry such as triple modular redundancy (TMR) or safer finite state machines (FSMs) at the hardware description language (HDL) level. Each method has its tradeoffs in terms of cost, schedule, safety, and quality-of-results. For example, rad-tolerant silicon is substantially more expensive than its commercial counterpart. Device-level triplication adds various types of overhead including cost, area, and power. And hand coding requires high levels of expertise, it’s time consuming, and it can result in hard-to-detect errors.

Recently introduced technology automatically incorporates advanced mitigation circuitry during device-neutral RTL synthesis, thereby providing additional protection through the implementation flow itself. This way designers and project managers can more flexibly select the most suitable (or lower cost) silicon for their application, include adequate safeguards against a variety of radiation effects, avoid the overhead of board- or device-level redundancy, and reduce the risk of human error from hand-coded mitigation. With support for SRAM, flash, and antifuse architectures, this new technology can accelerate the development schedule, expand device options, and reduce production cost. However, as with most automatic implementation approaches, designers should ask: How does this technology work? And how can I verify it?

Synthesis-Based Automatic Mitigation

The synthesis approach essentially transforms a regular RTL design into a mitigated gate-level netlist, as shown in Figure 1. A designer decides which protection scheme to use and which areas of the design to target, but the implementation itself is automated. The technology offers two primary forms of automatic mitigation:

  1. TMR for multiple vendors’ FPGAs and multiple modes of protection, to be discussed below.
  2. Safer FSMs for protection against single bit flips of storage elements, known as single event upsets (SEUs), which can occur regardless of the FSM encoding scheme. The protection can be in the form of detection of, and recovery from, an SEU, or it can be complete correction where an SEU will not disturb the state machine’s operation.

We will discuss both of these forms, including the types of protection they provide, configurability, and verification options.

Triple Modular Redundancy (TMR)

TMR is a widely accepted method of fault-tolerant design, where a unit is tripled and fed into one or more majority voter circuits. If an upset occurs on any one unit, the majority voter(s) value will be the correct value and hence mask the fault.

In logic design, an SEU can be masked if storage elements are TMR’d. While this can be done by hand at the HDL level, synthesis-based mitigation offers redundancy called Local TMR (LTMR), where the process is automated.

Synthesis is where most design optimizations take place, so it’s an ideal stage at which to make intelligent decisions about what to infer and when. In the case of LTMR, as an example, synthesis can decide whether or not to infer embedded shift registers and how to treat synchronizers across different clock domains.

Another form of synthesis-based TMR is Distributed TMR (DTMR), where all storage elements, combinational elements, and voters are triplicated. Primary I/Os can optionally be triplicated as well. DTMR offers additional protection against glitches in combinational logic known as single event transients (SETs) and can mask FPGA configuration upsets that can change user logic and data routes. Some rad-tolerant devices have shown SET susceptibility at higher radiation levels. Synthesis-based mitigation can provide additional mitigation in such environments, hence complementing the protection that is built into rad-tolerant FPGAs. At lower radiation levels, this approach also helps mask configuration errors of SRAM-based FPGAs.

Again, as with LTMR, DTMR relieves the developer from making low-level implementation decisions, such as how to insert voters in feedback loops or properly infer a TMR’d DSP function.

The last form of redundancy is Global TMR (GTMR), where registers, combinational logic, voters, and global buffers such as clocks and resets are triplicated. This offers additional protection against SETs and configuration SRAM upsets not only on user logic and data routes but also on global routes that may be vulnerable in some device architectures.

“Detection & Recovery” and “Corrections”

Figure 2. Global TMR — Triplication of Sequential Logic, Combinational Logic, Voters, Global Routes, and I/Os.
The second synthesis-based approach is safer FSM implementation, which is suitable primarily when control logic is critical. This approach can be less costly in terms of area and performance than TMR, though the level of protection may differ. For example, safer FSM does not protect against SETs and does not mask FPGA configuration errors.

The fundamental principle behind a safer FSM is to prevent the state machine from getting stuck in an unknown state or getting out of sequence due to an SEU. For example, for a three-state binary encoded FSM, most synthesis tools that run in default mode will optimize away the unspecified states since they are considered unreachable. However, an SEU can flip one bit and put the FSM into an undefined, invalid state that locks up the circuit. The simple, safer FSM available in most synthesis solutions simply maps all state bit combinations to defined states. Unfortunately, this mapping does not prevent an SEU from causing an invalid state transition between valid FSM states, as shown in the simple binary FSM in Figure 3. Such out-of-sequence transitions may be unacceptable for some applications.

On the other hand, the new synthesis technology can implement FSMs with full SEU detection and recovery for all encoding schemes such as area efficient binary and grey encoding. This adds just two extra parity bits, so there is minimal impact on device area. However, in this type of safer FSM, an SEU will interrupt circuit operation, sending the FSM to the recovery state defined and specified by the designer.

Alternatively, the new synthesis technology can implement an SEU correcting FSM, where an SEU does not interrupt the normal FSM operation. The impact on device area is greater with this scheme. However, the increase is logarithmically related to the FSM size, so the impact lessens for larger state machines. A simple two-state SEU correcting FSM is shown in Figure 4.

Mitigation efforts must always be balanced against their effect on quality of results (QoR). In some cases a design-wide TMR scheme would exceed the capacity of the device. Hence designers need granular control of the design in order to target the blocks most critical to mitigate. Fortunately, the new synthesis technology provides this level of granularity down to a particular module, flop, or net. It also makes mitigation-aware optimization decisions such as whether or not to infer one operator over another, and can minimize the impact on QoR by inserting voters only where necessary.

Verifying Mitigation Schemes

One of the new technology challenges is how to verify that the original design functionality has not been altered by the automatic insertion of mitigation circuitry. Such verification is especially important in safety-critical applications. Gate-level simulation is an option, but may take too long. To address the verification challenge, the new synthesis-based mitigation was designed to be compatible with a formal equivalency checking tool that automatically compares the original RTL to the post-synthesis netlist.

Figure 4. 2-state SEU Correcting FSM.
To verify that the mitigation circuitry properly protects against radiation, engineers can simulate fault injection at gate level simulation or inject configuration faults in hardware. When budget is available real radiation testing is recommended.

Conclusion

Figure 3. Simple State FSM with Invalid Transitions.
Concerns about radiation effects continue to grow, and not just for aerospace and defense but also for many ground-based mission- and safety-critical applications. Newly introduced synthesis-based mitigation technology gives FPGA designers an important tool in their arsenal. The technology can complement mitigation built into rad-tolerant FPGAs by adding extra protection where needed, and also balancing performance and reliability needs. Additionally, the new solution enables designers to use substantially more capable and less expensive FPGAs that were off limits before due to the need for complicated and risky hand-coded mitigation. Lastly, synthesis-based mitigation creates a netlist that is easier to verify by using formal equivalence checking methods.

This article was written by Daniel Platzker, FPGA Synthesis Product Line Director, and Ehab Mohsen, Technical Marketing Engineer, Mentor Graphics Design Creation & Synthesis Division (Fremont, CA). For more information, Click Here