The Purpose of Mixed-Effects Models in Test and Evaluation

The simplest version of a mixed model—the random intercept model, where so-called random effects represent group-wide deviations from a grand mean—can account for day-to-day deviations in system performance while still allowing the results to be generalized beyond the few days of observed testing.

Mixed effects models are the standard technique for analyzing data that exhibit some grouping structure. In defense testing, these models are useful because they allow accounting for correlations between observations, a feature common in many operational tests.

Variation in target location error is observed across six days of testing. Average TLE for each day is represented by a red x. The global average is denoted by the blue dashed line.

By design, operational test data are often noisy. Scenarios with real operators conducting realistic missions against a responsive opposing force generate data that reflect realistic combat environments and include operationally important sources of variance. The further one moves from lab experiments, the more uncontrollable circumstances and conditions will influence the numbers being reported.

One strategy for dealing with noisy data is simply to collect more of it. As the sample size n increases, uncertainty bars shrink, and the risk to the program is reduced. But this is an inefficient use of taxpayer dollars and, because test budgets are limited, often infeasible. An alternative is to do more with the available data.

Mixed effects models, or simply mixed models, are a well-studied statistical technique used regularly across diverse fields of research, including pharmacology, agriculture, image analysis, and biology. Mixed effects models are used in cases where researchers suspect the data contains systematically correlated errors. By properly accounting for these correlations, mixed effects models produce estimates with smaller uncertainties.

A canonical example from agriculture is a comparison of crop yields from different seeds planted in multiple fields. Each field is divided into some number of plots, one type of seed is planted in each plot, and at the end of the test, the yield of each plot is measured. Each field has unique characteristics, including exposure to sun, irrigation runoffs, and which plants are adjacent to the field. Since these characteristics affect crop yield, the attributes of each field introduce noise to the data.

Because the goal is to determine how different types of seeds will perform in the future, the unique properties of the fields used for the test are not relevant. By using a mixed effects model, the field-level variation can be estimated separately from random plot-to-plot variation. This makes it easier to make comparisons, narrows the confidence bounds around estimates of the average yield of each seed type, and reduces the chance of confusing random variation due to a particularly good or poor field with changes in yield due to seed type.

While Defense Test and Evaluation (T&E) is different from agriculture, data attributes can be surprisingly similar. Consider an evaluation of a radar-equipped unmanned aerial vehicle (UAV) operating in a maritime environment. The radar system provides maritime surveillance and intelligence by detecting targets at sea and reporting their locations to a host platform. The goal of the test is to compare the radar system’s target location error (TLE, the straightline distance between the true location of a target and the location provided by the radar system) against a requirement of 250 meters.

The test plan calls for collecting six days’ worth of data spread over the course of one month. The primary factor driving radar system performance is the distance between the UAV and the target. Therefore, the test requires that performance be measured along a range of distances to the target. Distance is analogous to seed type in the above example. Environmental factors (the state of the water over which the aircraft is flying, atmospheric conditions, etc.) will also affect radar performance. These will vary from day to day, meaning that TLE measurements will be correlated within each day. Day is analogous to the field factor in the agricultural example; that is, both factors partition the data into groups.

The accompanying figure shows the day-to-day variation of TLE. It is clear single days are grouped. For example, the conditions on days 1 and 2 seem to have provided the best set of environmental factors for the UAV’s radar system. Ignoring the day-to-day variability could cause wrong conclusions to be made about the system’s performance.

This work was done by Rebecca Medlin, John T. Haman, Matthew R. Avery, and Heather Wojton for the Institute for Defense Analyses. For more information, download the Technical Support Package(free white paper) below. IDA-0002

This Brief includes a Technical Support Package (TSP).
Document cover
The Purpose of Mixed-Effects Models in Test and Evaluation

(reference IDA-0002) is currently available for download from the TSP library.

Don't have an account? Sign up here.