Authors: Bruce C. Dudek, Ph.D. and Jason Bryer,
Last updated: 2022-05-12
Maximum Likelihood Estimation (MLE) has become an indispensable procedure for estimating parameters in statistical models. A great many methods rely on it, including machine learning, generalized least squares, and linear mixed effects modeling. Its core principles can also be viewed as a helpful starting point for understanding bayesian estimation. Our concern is instruction of MLE to students in introductory statistics or data science classes. Rather than immediately jump to the detailed mathematical/equational instructional process (most fully done with calculus), we have attempted to find visualizations that help students grasp the concepts. Later detailed examination of the methods can include full treatment of the mathematical foundations, but those attempts will flow more smoothly if the initial conceptualization is aided by graphical methods. In modern data science and application of statistics, researchers will often employ MLE methods where well-developed optimization algorithms are established and often used without detailed knowledge of their structure. The reality is that many researchers and data scientists are probably using MLE optimization methods without the full mathematical background since the efficient algorithms for optimization are embedded in other application functions, e.g., generalized least squares, where the analyst does not need to actually develop the likelihood function. Our contention is that a better conceptual background can be developed with visualization techniques and while not a complete replacement for mathematical foundations, these methods can better enable analysts to grasp what their methods are doing.
In order to accomplish this goal, the reader is expected to have rudimentary understanding of sampling from a bernoulli process and the binomial distribution, normal distribution theory, and for a second part of the tutorial, ordinary least squares - linear regression. The document is targeted at students in an introductory statistics class in a research discipline - perhaps graduate level, depending in discipline.
This document will also provide R code for generating the illustrations and visualizations. Sometimes those code chunks are directly shown and sometimes they are hidden, but can be expanded in the html version of this document.
It is important to lay out definitional usages of the terms probability and likelihood at the outset. These terms have interchangeable meanings in common language usage, but in data analysis and statistics, their meanings can be distinct. The Oxford and and Merriam Webster’s dictionaries both say that probability is a synonym for likelihood.
The Oxford Reference Online Site has a more nuanced definition of likelihood that mixes elements of the two concepts that we will distinguish below: “The probability of getting a set of observations given the parameters of a model fitted to those observations. It enables the estimation of unknown parameters based on known outcomes.”
We can examine this common usage with an illustration using jelly beans of different colors - a popular illustration in introductory statistics classes. Let’s assume that we have a single container with a very large number of jelly beans and they are thoroughly/randomly mixed. Further specify that 65% are Berry Blue flavor and 35% are Cotton Candy flavor (the Jelly Belly brand). We ask about the possible outcomes if one candy is randomly taken out of the container. (For later illustrations, assume that we always put any candies back after removing them, thus sampling with replacement). It is understood that this is sampling from a stationary bernoulli process and the probability of pulling a BB is .65 (let’s call this ‘p’) and the probability of pulling a CC is .35 (let’s call this ‘q’). But we also phrase these probabilities as likelihoods at times: “The likelihood of getting a BB is .65”. This is perfectly fine according to the OED and Merriam-Webster’s.