The Future of Managing Mission-Critical Systems

Adhering to functional safety standards is demanding in terms of both time and money.

April 1, 2022

There’s no doubt the complexity of aerospace design systems is constantly increasing, driven by new demands on architecture and next-generation technologies. As a result, the costs and time associated with the creation, certification and deployment of mission-critical electronics hugely heighten if the systems are not managed in a new way.

The established approach of adopting an RTOS that can be certified in accordance with functional safety standards presents several challenges. The development and maintenance of an RTOS to meet the demanding objectives of (for example) FAA DO-178B/C DAL A is a considerable undertaking, which represents a significant investment for any system developer. The fact they are so complex means they also have substantial (and suboptimal) footprints. Functional safety standards should ideally minimize development overhead by permitting the separation of a software system into software items with the aim of placing as little of the system as possible into the more critical classes.

For many applications, there are now more optimal solutions from a safety, security and operational efficiency perspective. The most successful approach a company can take is to use an alternative mechanism to fulfil the kind of sophisticated multi-tasking requirements traditionally addressed by an RTOS. The specific path forward that’s being adopted by many in the industry is harnessing mixed criticality systems to achieve adequate separation between software items, which is vital to the integrity of meeting functional safety standards.

Reconsidering Your Existing Approach

If you’re reconsidering your approach to managing mission-critical systems, you might be wondering what your timeline should look like and any external factors playing into that. Two points come to mind.

First, as of now the FAA IMA(- DO-297) certification process mandates the use of partitions. Aircraft manufacturers can opt out of IMA. However, IMA, in general, poses the long-term benefit of achieving economy of scale when certified modular partitions are re-used.

Second, from a safety perspective, the VM platform model wins out. With an increase in multicore processor adoption, many system architects have been consolidating the operation of multiple applications onto single SoCs. The danger is that as these SoCs share resources (memory, IO peripherals, etc.), applications can interfere with others. While the impact of this for a consumer product is minimal, it is significant for mission-critical systems.

When these systems demand certain real-time events be addressed in a deterministic, predictable and immediate way, interference - whether created accidentally such as through poorly written applications, or maliciously (e.g. cyberattacks) - becomes a significant issue. The strategy to mitigate these issues is multifold, incurring the provable isolation of applications from each other; architecting systems such that the fundamental elements of a system that keeps humans safe and secure cannot be altered if an application fails or misbehaves; and early recognition of - and recovery from - systems being compromised.

A Separation Kernel Hypervisor Defined

Figure 1. RTOS versus Separation Kernel Approach

A separation kernel is a special type of bare metal hypervisor that only does separation. More specifically, it is a tiny piece of carefully crafted code (as small as 15 KB) that utilizes modern hardware virtualization features to:

Define fixed virtual machines (VMs)
Control information flows.

Separation kernels contain no device drivers, no user model, no shell access, and no dynamic memory; these ancillary tasks are all pushed up into guest software running in the VMs. This simple, elegant architecture results in a minimal implementation that—while less convenient for desktop use—is an excellent fit for embedded real-time and safety-critical systems.

The separation kernel concept was first described by John Rushby in his 1981 paper Design and Verification of Secure Systems. Rushby writes:

“...the task of a separation kernel is to create an environment which is indistinguishable from that provided by a physically distributed system: it must appear as if each regime is a separate, isolated machine and that information can only flow from one machine to another along known external communication lines.”

Rushby’s idea is that separation is too important to be managed by the OS. The OS is large, complex, and responsible for many things, and thus extremely difficult to make “watertight” from a security perspective. He realized that the best way to build a secure computer system would be to factor out the management of separation from the OS into a new kind of kernel focused exclusively on separation. He called this new kernel a separation kernel. The separation kernel should be small and simple enough that it can be intimately examined and fully understood to the point of being formally proved to be correct.

Separation kernel use-cases were initially secure workstations responsible for high security government and DoD applications requiring separation of Top Secret, Secret, and Confidential information classifications. Embedded military network communications systems such as secure radio gateways followed, and more recently separation kernels have found application as a superior hypervisor in embedded systems and in safety-critical avionics systems seeking stronger separation to manage multicore interference. Despite this, separation kernels have remained a niche concept predominantly acknowledged in the security industry.

Separation Kernel Hypervisors Enable Robust Space Partitioning

Space partitioning means that a partition must be prevented from accessing the software or data of other partitions. In an OS this is achieved with a process. A process is an MMU-enforced region of memory that contains tasks. Triggered by the fork() call, processes are created on-the-fly by an OS by configuring the MMU to create a new protected memory region and populating it with a task. The same concept of using the MMU to enforce protected memory regions is used by a separation kernel hypervisor, but with three differences:

Configures the MMU once – at boot time. Partitions are fixed until power-down.
Does no task scheduling. Scheduling is left to software running in the partition.
Uses a special MMU - the second level address translation (SLAT) designed specifically to support virtualization.

Intel’s SLAT is called EPT, Extended Page Tables; Arm’s implementation is the Stage-2 MMU. SLAT provides nested MMU paging that allows the hypervisor, running in privileged mode, to map physical memory to create partitions. The trick is that guest OSs running in those partitions have their own MMU and use it as normal from their own kernel space to create their own (MMU protected as normal) user processes. A guest OS running in this type of virtualized environment is oblivious to the presence of the hypervisor.

Multicore Interference Isolation

With robust space partitioning solved by the hypervisor, the problem of time partitioning is left exposed, open for study. Achieving space partitioning without an RTOS improves the accuracy of our interference timing measurements. It’s like polishing the lens of a microscope. Multicore timing effects may be fleetingly brief and unpredictable.

Eliminating the RTOS helps in two ways. First, it reduces the memory pressure in the system. When executed, every piece of RTOS code and data needs to be loaded into the CPU over the memory hierarchy, consuming memory bandwidth and polluting the caches. Interference caused by contention for these very devices is precisely what you are trying to measure. Second, the RTOS’s memory pressure itself is unpredictable. The RTOS scheduler pre-empts and switches between tasks, generating a complex instruction stream that creates an unpredictable load on the system. In both ways the RTOS creates increased background noise that obscures the interference you are trying to measure.

A new trend emerged at the HiPEAC (High Performance Embedded Architecture and Compilation) research conference in early 2021, where three research projects presented their multicore safety-critical software platforms:

MASTECS (Multicore Analysis Service and Tools for Embedded Critical Systems)
De-RISC
SELENE

All three projects are attacking the same problem – the multicore problem — and all of them are using hypervisors to do it. Barcelona Supercomputer Center (BSC), which is involved in all three projects, is leading the MASTECS project. BSC is contributing its software microbenchmarks. RAPITA Systems’ software timing analysis expertise and RVS tool is used to characterize multicore interference on an automotive use case from Marelli Europe and avionics use case from Raytheon Technologies. Their microbenchmarks are tiny loops of code that specifically target parts of the multi-core processor to create contention in a controlled and predictable way. They are perfect for studying the effectiveness of partitioning solutions and mitigations. With the multicore system cleanly and minimally configured, the BSC suite of microbenchmarks are used to stress the use-case application. Various combinations of benchmarks are run bare-metal on processor cores inside hypervisor partitions. In this setup, robust space partitioning is provided by the separation kernel hypervisor and time partitioning can be precisely tested in isolation.

Leveraging Modern Hardware

Modern multicore processors contain a rich set of resources. As well as multiple cores, they include peripherals, memory, and advanced virtualization features that enable them to be treated like a LEGO set of components for building configurations of virtual machines (VMs). Although heavily used in cloud data centers, these modern hardware technologies are often poorly supported by RTOSes and embedded hypervisors.

Separation kernels can be used to partition processor hardware resources into high assurance VMs that are both tamper-proof and non-bypassable, and to set up strictly controlled information flows between VMs and peripherals so that VMs are isolated except where explicitly allowed. Effectively, a separation kernel is a “processor partitioning system” that allows builders of embedded systems to unlock the benefits of modern full-featured multi-core processors.

The advantage of a separation kernel hypervisor lies in the simplicity of its derivation from a static partitioning system that leverages a configured hardware platform to create independent, isolated hardware instances (or subsystems) for VMs. Systems become partitioned in a way where the amount of code that needs to be certified is minimized as it’s isolated from other applications.

The VM platform model is regarded as the superior architecture for safety due to this simplicity. It makes development, timing adjustments and analysis a straightforward exercise with minimal surprises and fewer engineering challenges. Also of note, major improvements to hardware virtualization in both the processors and peripherals have significantly reduced negative attributes of the VM model, reinforcing the RTOS and separation kernel hypervisor yin and yang relationship.

Proven to operate in the intended deterministic, real-time way, a separation kernel hypervisor remains the only way to keep operational costs and hours down while ensuring security and safety are airtight.

Another consideration being made by many is securing the ability to deploy safety-critical control algorithms as independent bare metal applications – that is, an application that uses no operating system at all. This enables developers and evaluators to measure the interference between software components and means that critical applications will meet their timing deadlines. With a separation kernel hypervisor, each VM is able to run just enough RTOS to get its job done. At one extreme, a VM might host an entire open source RTOS such as FreeRTOS or Micrium μC/OS. Another, separate VM might host a bare metal application. Any combination of these VMs can be combined into a system.

Therefore, it can be concluded that separating the OS makes sense now and it’ll continue to make sense into the future, regardless of industry certification rules.

Enter the Z-Application

A “Z-app” (short for Z-application) is a collection of separation kernel virtual machines. The Z-app concept addresses the needs of application developers looking to achieve sophisticated, hard real-time behavior complete with function protection and domain separation, while avoiding the overheads inherent in RTOS use.

Z-app was originally conceived by Lynx Software Technologies to address an issue in the automotive sector. The classic AUTOSAR stack implementation used in that industry runs all functions in a flat address space and uses a microcontroller RTOS (typified by the ETAS RTA-OS) to schedule them. Such an approach offers no domain separation or function protection.

This problem was resolved by introducing a flexible scheduler (“Z-scheduler”) hosted by a dedicated VM. Replacing the scheduling functionality found in RTA-OS, Z-scheduler is a function caller that jumps into a separate memory dimension using hypervisor context switch “hypercalls.” AUTOSAR functions are implemented as “Z-functions” in separate VMs, hence providing the required scheduling capability coupled with domain and function protection.

Z-Application Architecture

Figure 3. Z-app architecture showing how Z-function and Z-scheduler VMs are hosted, and how shared memory is leveraged.

Today’s Z-application is a collection of separation kernel hypervisor virtual machines that belong to a common execution group. Each Z-app instance establishes a conventional framework for bare-metal applications, modelling a program stack such that a program runtime creates a standard memory layout to organize the execution flow of functions within a main program (Figure 3).

In practice, at runtime, Z-app mimics a conventional computer program such that the Z-scheduler takes the role of the “main” entry point. Each Z-function is allocated its own VM (Lynx terminology is a “room”) such that it is the equivalent of a method or function in a conventional program, but with the benefit of protection via separation. Global/heap memory and stack memory are allocated and utilized exactly as they would be in a conventional program.

Scheduling

Unlike an RTOS – which typically manages thread priority scheduling in an inaccessible “black box” scheduler – scheduling in a Z-app becomes the responsibility of the application running in guest space in the form of its Z-scheduler. The Z-apps characteristics not only ensure that address space separation is maintained, but also makes the implementation of custom schedules much easier.

Figure 4. Z-scheduler implementation of a periodic scheduler with HW timer-enforced budgets.

The Z-scheduler calls the Z-functions (zFns) according to a scheduling algorithm that is customizable to suit each application. For example, Figure 4 illustrates a Z-scheduler implementation of a periodic scheduler with HW timer-enforced budgets.

Time Donation

As shown in Figure 5, time donation functionality is also available, allowing a Z-function to donate the remainder of a time-slice to another Z-function. The implementation mechanism is wrapped to look like standard C language function calls.

Measurement libraries and constructs are available to provide metrics for analysis, including time in a Z-function, time remaining after return (slack), time expiration exceptions, and various PMU values. The architecture and its features lend themselves to hierarchical scheduling, allowing VMs to run in accordance with independent scheduling schemes and priority groups.

Certification Considerations

Figure 6. Future Airborne Capability Environment (FACE) Approach

Figure 7. Design Approach to Running Safety Critical Applications.

The overhead of certification is a primary concern in many of the sectors likely to have a use for this technology. The integrity afforded by the function protection and domain separation characteristics have been discussed, but perhaps less obviously the modularity of the discrete Z-functions aids reuse and aids behavior analysis. Figure 6 shows a design where Future Airborne Capability Environment (FACE™) applications rely on deep abstraction layers implemented across multiple CPU cores. In this design, the potential amount of interference that the FACE guest can generate in the hardware alone will create a challenging analysis exercise to ensure critical applications will meet their timing deadlines.

This illustration also shows the impracticality of hosting a critical application on deep abstraction layers that must defend integrity and timing analysis of a complex runtime environment inheriting interference from both the hardware and the layers of abstraction concurrently accessed by co-hosted applications.

The system illustrated in Figure 7 shows a design that minimizes abstraction layers and restricts dependencies to basic software components for running safety critical applications. The FACE portion of the design is limited in hardware access and is only used as a simple transport of packets to maintain network interoperability, while the bare-metal safety critical applications run autonomously.

This design gives architects and evaluators precise insight both into where critical applications are running, and into their dependencies. Evaluators can measure the worst possible case of interference generated by a given virtual machine. They are also given assurances that there are no internal software platform dependencies on complex abstractions such as “syscalls” into SMP kernels, internal thread queues, global data locks, and coherency protocols that could make the interference analysis extremely difficult.

Summary

Adhering to functional safety standards is demanding in terms of both time and money. Across the safety critical sectors, there is an increasing desire to adopt multicore processors to address the security concerns heightened by a seemingly unassuageable thirst for connectivity. The conventional approach has been to adopt an RTOS, but the inherent lack of flexibility and overhead associated with that approach – coupled with a drive to minimize certification costs through segregation and other means – makes it suboptimal for many applications.

The emergence of Z-applications fills the void where simple, bare-metal, VM-hosted applications are not sufficiently sophisticated and where RTOS are simply too unwieldy to provide an ideal solution. Such an approach brings benefits across the critical sectors. By implementing those items as Z-apps, their footprint is minimized and their separation from less critical applications is assured.

This article was written by Will Keegan, CTO at Lynx Software Technologies (San Jose, CA). For more information, visit here .