Multicore Tools for Embedded Systems — Staying Ahead of the Game —

Building a multicore system means dealing with non-determinism. Interactions between tasks running on different cores can occur in a different order, and at a different rate from one run to the next. This makes it harder to reproduce, find, and fix bugs. It also lowers the probability that validations and QA have caught all problems.

The following is an example of how moving to a multicore processor can introduce bugs that are difficult to track down because of non-determinism.

Example of a Multicore Bug

Figure 1. Timeline of a multicore bug.

A network router has been set up to handle ARP responses in an interrupt handler. The handler checks to see if a global lock is held, and if not, it updates the ARP cache immediately and returns to the idle loop. If the lock is held, then it adds the ARP response to a list and the update happens later. This design was initially implemented for a single core system, where no other code can execute while the interrupt is being serviced, so the design worked every time. However, when ported to multicore, the second core may grab the lock shortly after the ISR on the first core has checked to see if the lock is taken. This lead to crashes, as shown in Figure 1.

Debugging this crash is difficult because it is the first core that triggered the problem; but when debugged the first core appears to be unrelated to the problem because it is in an idle loop, doing nothing wrong.

However, this is the easy to debug case. Sometimes, instead of crashing on a NULL pointer dereference, core 2 could end up getting back the wrong ARP table entry. This would result in the packet being sent to the wrong client, which would result in bugs that are even harder to track down.

How to Move to Multicore

Almost everyone who switches to multicore finds that the application they thought was properly written to handle concurrency problems, actually has lots of bugs. Fortunately, there are tools available to make the transition from single core to multicore less painful.

We are going to focus on tools and features that are provided by the operating system. Most embedded operating systems, such as Green Hills Software’s INTEGRITY® RTOS, provide these features. The following tips on how to use these tools are focused on increasing the determinism of a system. This will make it possible to release the product faster and with fewer painfully long debug and test cycles.

Increase Determinism with Address Spaces

The simplest OS feature is splitting threads into their own address spaces. Yes, this is technology from the 70s, but much of the embedded world still runs everything in the kernel or a small handful of giant address spaces with no MMU protection; for example: over 100 threads with several million lines of web server, secure shell, configuration and other application features all running in kernel space. The entire system is at the mercy of the worst programmer, on his worst day, making risky changes just before he leaves for a three week vacation.

Pulling threads into their own address space makes it easier to tell where the critical sections are that require locking. When threads share the same address space, every line of code is a potential critical section that needs to be checked. When running in different address spaces, the communication and interaction between them is, by necessity, far more constrained, which means it is easier to analyze the interactions and ensure that they are correct. It is also easier to log those interactions.

Always Log So You Can Always Debug

Robust, low overhead logging provided in the OS is needed to help you manage the inevitable non-determinism that will creep into your system. No matter what you do, if you interact with the outside world in any way, your program will be non-deterministic to some degree. It is essential to have logging in the OS, so you can see what is happening at the lowest levels, interleaved with additional application specific logging you add to the log stream.

To make sense of the log data, you also need a graphical tool that will display it in a linear time display. This should let you easily zoom into areas of interest, correlate specific events with the threads that generated them, and tell you where they came from in the code. You need this when developing a multicore system, because there will inevitably be problems that show up very infrequently, perhaps only once in weeks of testing. With logging, there is a good chance that it will be possible to figure out what happened and fix it. The simplest way to do this is to compare a log of the failure with a log of a successful run. Differences in the logs are good indications for where to narrow the search.

The logging mechanism needs to have low overhead so that it can always be on. This is for three reasons:

  1. Some bugs are so timing dependent that they disappear when logging is enabled. If logging is always on, however, even in the final production system, then you will never have this problem.
  2. If a hard-to-reproduce bug does show up, you will have a log of it and you can investigate the problem right away. If, instead, you only turn the log on when you are looking for a problem, you are bound to waste time trying to get a log of the failure after the fact.
  3. If a problem shows up in the field, there will be a log to inspect. Asking customers to turn on logging and spend their time reproducing problems leads to angry customers, especially when they cannot get the problem to reproduce with logging enabled.

If the log is always on it will eventually overflow the log buffer. To address this problem the log should be stored in a circular buffer. When a problem is detected the code should save the log buffer to more permanent storage for inspection later.

Of course, there is a cost to always having logging enabled — it will use up some CPU time, and it will use up some memory. However, if you place the log points carefully, you will find that it will save considerable development and release time. Elusive bugs will be fixed instead of being discovered by customers.

Maintaining Determinism of a Single Core System

Figure 2. Viewing log data in a zoomable, linear time display.

Advanced operating systems have the ability to lock a thread to a specific core, and even to specify which core receives specific interrupts. When porting to a multicore chip start by using core affinity to lock all application threads to a single core. The goal is to minimize changes in the environment that your program is running on, to get it up and running, moving one step at a time and keeping your system as stable as possible.

Once it is running on just one of the cores, you can pick a specific thread that is processing intensive and assign it to one of the unused cores. This greatly reduces the amount of code that you have to inspect, test, and debug to ensure there are no unexpected timing dependencies. At the same time, it also increases determinism in the entire system. For example, a network router could dedicate one core to general purpose code. Each additional core handles a specific set of network ports. Interrupts for those ports are delivered only to that core, which helps limit interaction between threads on different cores.

As an additional benefit, when each thread is in its own address space and assigned to a specific core, you may actually get better performance than with a monolithic single address space. This is because each core has its own cache, and cache contention could significantly slow your application in hard-to-understand and hard-to-predict ways. Sometimes it will run fast, sometimes it will run slow, and even minor modifications to the code could have a large impact on performance because data could fall in or out of cache lines that are in contention.

Once your system is working and stable, and if you have more worker threads than cores, you may ultimately decide to let the SMP OS determine natural affinity and optimal scheduling across cores. In most cases, the OS will do a better job of assigning threads to cores dynamically than a human can statically. Just remember that this will have an impact on determinism, and it may result in a product which is not as reliable, but a bit faster.

Using AMP or a Hypervisor

Figure 3. Enhanced Type-1 embedded multicore hypervisor.

Some manufacturers are moving to multicore to consolidate functions running on multiple chips onto a single chip. This can save money in manufacturing costs, hardware design complexity, and battery life to name a few benefits. The quickest, safest, and cheapest way to do this is to use AMP, or a hyper-visor with one guest OS for each original chip. The timing will be slightly different, but it is as close as you can be to the original multi-chip software design, running on a single chip.

If your goal is to reduce cost you should seriously consider this approach rather than attempting to get all aspects of the application running on a single SMP OS. The separation you have between the different components should give you the ability to log interactions and isolate problems. Sticking with the separation will also make you focus on getting the product done rather than embarking on a costly software redesign.

Use What You Need, Not What You Have

When working with a multicore system it is important to remember that you are attempting to accomplish a specific goal, which is rarely to simply “go multicore”. Some of the techniques above may result in cores going unused, changes to the design of the system or some additional CPU time spent on logging. However, these techniques will result in a system where it is easier to find bugs, easier to fix bugs, and easier to test and validate correctness. In short, it will be easier to build a product that works, and works reliably.

This article was written by Nathan Field, Engineering Manager, Green Hills Software (Santa Barbara, CA). For more information, contact Mr. Field at This email address is being protected from spambots. You need JavaScript enabled to view it., or visit .