Maximum Likelihood Estimation

(Updated April 10, 2018)

Today, let's take some time to talk about Maximum Likelihood Estimation (MLE), which is the default estimation procedure in AMOS and is considered the standard for the field. In my view, MLE is not as intuitively graspable as Ordinary Least Squares (OLS) estimation, which simply seeks to locate the best-fitting line in a scatter plot of data so that the line is as close to as many of the data points as possible. In other words, OLS minimizes the squared deviation scores between each actual data point and where an individual with a given score on the X-axis would fall on the best-fitting line, hence "least squares." However, Maximum Likelihood is considered to be statistically advantageous.

This website maintained by S. Purcell provides what I think is a very clear, straightforward introduction to MLE. In particular, we'll want to look at the second major heading on the page that comes up, Model-Fitting.

Purcell describes the mission of MLE as being to "find the parameter values that make the observed data most likely." Here's an analogy I came up with, fitting Purcell's definition. Suppose we observed a group of people laughing uproariously (the "data"). One could then ask which generating-model would make the laughter most likely, a television comedy show or a drama about someone dying of cancer?

Another site lists some of the advantages of MLE, vis-a-vis OLS.

Lindsay Reed, our former computer lab director, once loaned me a book on the history of statistics, the unusually titled, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century (by David Salsburg, published in 2001).

This book discusses the many statistical contributions of Sir Ronald A. Fisher, among which is MLE. Writes Salsburg:

In spite of Fisher's ingenuity, the majority of situations presented intractable mathematics to the potential user of the MLE (p. 68).

Practically speaking, obtaining MLE solutions required repeated iterations, which was very difficult to achieve, until the computer revolution. Citing the ancient mathematician Robert Recorde, Salsburg writes:

...you first guess the answer and apply it to the problem. There will be a discrepancy between the result of using this guess and the result you want. You take that discrepancy and use it to produce a better guess... For Fisher's maximum likelihood, it might take thousands or even millions of iterations before you get a good answer... What are a mere million iterations to a patient computer? (p. 70).

UPDATE I: The 2013 textbook by Texas Tech Business Administration professor Peter Westfall and Kevin Henning, Understanding Advanced Statistical Methods, includes additional description of MLE. The above-referenced Purcell page provides an example with a relatively simple equation for the likelihood function. Westfall and Henning, while providing a more mathematically intense discussion of MLE, have several good explanatory quotes:

In cases of complex advanced statistical models such as regressions, structural equation models, and neural networks, there are often dozens or perhaps even hundreds of parameters in the likelihood function (p. 317).

In practice, likelihood functions tend to be much more complicated [than the book's examples], and you won't be able to solve the calculus problem even if you excel at math. Instead, you'll have to use numerical methods, a fancy term for "letting the computer do the calculus for you." ... Numerical methods for finding MLEs work by iterative approximation. They start with an initial guess... then update the guess to some value... by climbing up the likelihood function... The iteration continues until the successive values... are so close to one another that the computer is willing to assume that the peak has been achieved. When this happens, the algorithm is said to converge (p. 325; emphasis in original).

This is what the Minimization History portion of the AMOS output refers to, along with the the possible error message that one's model has failed to converge.

UPDATE II: The reference given by our 2014 guest speaker on MLE is:

Ferron, J. M., & Hess, M. R. (2007). Estimation in SEM: A concrete example. Journal of Educational and Behavioral Statistics, 32, 110-120.

Free/Fixed Parameters & Model Identification

(Updated September 26, 2018)

A key concept in SEM is that of freely estimated (or free) parameters vs. fixed parameters. The term "freely estimated" refers to the program determining the value for a path or variance in accordance with the data and the mathematical estimation procedure. A freely estimated path might come out as .23 or .56 or -.33, for example. Freely estimated parameters are what we're used to thinking about. 

However, for technical reasons, we sometimes must "fix" a value, usually to 1. This means that a given path or variance will take on a value of 1 in the model, simply because we tell it to, and will not receive a significance test (there being no reason to test the null hypothesis that the value equals zero in the population). Fixed values only apply to unstandardized solutions; a value fixed to 1 will appear as 1 in an unstandardized solution, but usually appear as something different in a standardized solution. These examples should become clearer as we work through models.

Here is an initial example with a hypothetical one-factor, three-indicator model. Without fixing the unstandardized factor loading for indicator "a" to 1, the model would be seeking to freely estimate 7 unknown parameters (counted in blue parentheses in the following photo) from only 6 known pieces of information. The model would thus be under-identified* (also referred to as "unidentified"), which metaphorically is like being in "debt." Fixing the unstandardized factor loading for "a" to 1 brings the unknowns and knowns into balance.


As an alternative to fixing one unstandardized factor loading per construct to 1, a researcher can let all of the factors go free and instead fix the variance of the construct to 1. See slides 29-31 in this online slideshow.

There is a second reason for fixing parameters to 1. Keiley et al. (2005, in Sprenkle & Piercy, eds., Research Methods in Family Therapy) discuss the metric-setting rationale for fixing a single loading per factor to 1:

One of the problems we face in SEM is that the latent constructs are unobserved; therefore, we do not know their natural metric. One of the ways that we define the true score metric is by setting one scaling factor loading to 1.00 from each group of items (pp. 446-447).

In ONYX, it seems easiest to let all the factor loadings be freely estimated (none of them fixed to 1), but instead fix each factor's variance to 1.

---
*A simple algebra scenario provides an analogy to under-identification. One equation with two unknowns (under-identified) has no single unique solution. For example, x + y = 10 could be solved by x and y values of 9 and 1, 6 and 4, 12 and -2, 7.5 and 2.5, etc., respectively. But if we have two equations and two unknowns, such as by adding the equation, x - y = 4, then we know that x = 7 and y = 3.