Authors: Bruce C. Dudek, Ph.D. and Jason Bryer, Ph.D.
Last updated: 2022-05-12

Goals

Maximum Likelihood Estimation (MLE) has become an indispensable procedure for estimating parameters in statistical models. A great many methods rely on it, including machine learning, generalized least squares, and linear mixed effects modeling. Its core principles can also be viewed as a helpful starting point for understanding bayesian estimation. Our concern is instruction of MLE to students in introductory statistics or data science classes. Rather than immediately jump to the detailed mathematical/equational instructional process (most fully done with calculus), we have attempted to find visualizations that help students grasp the concepts. Later detailed examination of the methods can include full treatment of the mathematical foundations, but those attempts will flow more smoothly if the initial conceptualization is aided by graphical methods. In modern data science and application of statistics, researchers will often employ MLE methods where well-developed optimization algorithms are established and often used without detailed knowledge of their structure. The reality is that many researchers and data scientists are probably using MLE optimization methods without the full mathematical background since the efficient algorithms for optimization are embedded in other application functions, e.g., generalized least squares, where the analyst does not need to actually develop the likelihood function. Our contention is that a better conceptual background can be developed with visualization techniques and while not a complete replacement for mathematical foundations, these methods can better enable analysts to grasp what their methods are doing.

In order to accomplish this goal, the reader is expected to have rudimentary understanding of sampling from a bernoulli process and the binomial distribution, normal distribution theory, and for a second part of the tutorial, ordinary least squares - linear regression. The document is targeted at students in an introductory statistics class in a research discipline - perhaps graduate level, depending in discipline.

This document will also provide R code for generating the illustrations and visualizations. Sometimes those code chunks are directly shown and sometimes they are hidden, but can be expanded in the html version of this document.

Probability versus Likelihood

It is important to lay out definitional usages of the terms probability and likelihood at the outset. These terms have interchangeable meanings in common language usage, but in data analysis and statistics, their meanings can be distinct. The Oxford and and Merriam Webster’s dictionaries both say that probability is a synonym for likelihood.

The Oxford Reference Online Site has a more nuanced definition of likelihood that mixes elements of the two concepts that we will distinguish below: “The probability of getting a set of observations given the parameters of a model fitted to those observations. It enables the estimation of unknown parameters based on known outcomes.”

We can examine this common usage with an illustration using jelly beans of different colors - a popular illustration in introductory statistics classes. Let’s assume that we have a single container with a very large number of jelly beans and they are thoroughly/randomly mixed. Further specify that 65% are Berry Blue flavor and 35% are Cotton Candy flavor (the Jelly Belly brand). We ask about the possible outcomes if one candy is randomly taken out of the container. (For later illustrations, assume that we always put any candies back after removing them, thus sampling with replacement). It is understood that this is sampling from a stationary bernoulli process and the probability of pulling a BB is .65 (let’s call this ‘p’) and the probability of pulling a CC is .35 (let’s call this ‘q’). But we also phrase these probabilities as likelihoods at times: “The likelihood of getting a BB is .65”. This is perfectly fine according to the OED and Merriam-Webster’s.

Jelly Beans

Further imagine that we randomly draw 12 candies. We might ask the interchangeable questions:

What is the probability of obtaining 9 CC and 3 BB? What is the likelihood of obtaining 9 CC and 3 BB?

Instruction on this binomial process will often use either of these phrasings and it would not necessarily be inappropriate. So where does the distinction come in? Many textbooks, blogs, etc try to address this question and they all jump into detailed, varied, and sometimes elegant descriptions. Wikipedia is a good example.

We need a simpler explanation. It begins with a grammar reorientation. In the uses of the words probability and likelihood above, they were used as nouns. The difference in their usage in statistics begins when they are used as adjectives in front of “function” or “distribution”. From that point of view, our task is simple. In statistical analysis what is the difference between a probability distribution and a likelihood distribution?

A brief formal description would say that a probability distribution is the probability of what WILL happen to data if a specified model is controlling the sampling (e.g., the binomial colors of jelly beans when ‘p’ and ‘q’ are specified). A likelihood distribution describes a distribution of the same probabilities/densities when an observation (data) is specified but the parameter(s) are allowed to vary. But both involve probabilities calculated with the same function rule. Still confused? That is why the following visualizations are presented.

Establish a Probability Distribution Concept with the Binomial

Let’s reinforce the previous section’s distinction: A probability distribution is the distribution of probabilities associated with possible outcomes of data when a model parameter is specified (p/q). A Likelihood distribution is the distribution of similarly derived probabilities for observed data, but based on varying models (different p and q values in our binomial illustration).

Now let’s look at some details. With our bowl of jelly beans, we asked “what is the probability of drawing out 9 BB and 3 CC? We find the answer with the well known binomial probability rule:

\(\left( \begin{matrix} n \\ r \\ \end{matrix} \right){{p}^{r}}{{q}^{n-r}}\)

for our illustration: \(\left( \begin{matrix} 12 \\ 9 \\ \end{matrix} \right)*{{.65}^{9}}*{{.35}^{3}}\,\,=\,\,.1953651\)

We can also find this probability using the dbinom function in R:

dbinom(9,12,.65)
## [1] 0.1953651

For discrete random variables such as the binomial, we can use the terms probability and density interchangeably. This is the origin of the “d” prefix in the dbinom function name.

With the generic language approach, this probability is also sometimes called a likelihood. “What is the likelihood of finding 9 BB and 3 CC in a sample of 12?” It is not a problematic labeling, but confusing when likelihood is used as an adjective for a likelihood distribution (we’ll get to that below).

Next, we move to a visualization of a binomial probability distribution based on the jelly bean illustration implied above where 12 candies were drawn from the container where 65% were BB and 35% were CC. This figure displays probabilities for “all possible” outcomes of sampling 12 candies from the jar (with replacement). This is the binomial Mass Probability Distribution.

#setup values
n <- 12
xa <- 9
xb <- 6
propa <- xa/n
propb <- xb/n
p <- .65
probxa <- dbinom(xa,n,p)
probxb <- dbinom(xb,n,p)
x1 <- as.vector(seq(0,n, 1))
density1 <- dbinom(0:n,size=n, prob=p)
df1 <- data.frame(cbind(x1,density1))
ylimits <- c(0,(max(df1$density1)) + (max(df1$density1*.1)))
p1basetitle <- paste0(" Binomial Distribution (n=",n,", p=",p,")")
p1atitle <- paste0(" Binomial prob(x=",xa,", n=",n,", p=",p,")")
p1btitle <- paste0(" Binomial prob(x=",xb,", n=",n,", p=",p,")")
p1aprobtext <-paste0("prob=",round(probxa,3))
p1bprobtext <-paste0("prob=",round(probxb,3))
p1base <- ggplot(data=df1, aes(x=x1,y=density1)) +
  scale_x_continuous(breaks=seq(0,n,1)) +
  scale_y_continuous(limits=ylimits, expand=c(0,0)) +
  theme(aspect.ratio=3/4) +
  theme_classic() +
  labs(title=p1basetitle,
       x="Number of Successes (x)",
       y="probability (density)") +
  geom_segment(
    aes(xend = x1, yend = 0),
    size = 1.6,
    lineend = "butt",
    col = "gray"
  ) +
  geom_point(
    size = 4,
    col = "black",
    fill = "gray",
    shape = 21
  )
p1base

Next, let’s emphasize the outcome described above, drawing nine Blue Berry jelly beans if we sampled 12 times. Note that in the common language usage of the term we might also ask the same question as the likelihood of drawing nine Blue Berry candies. Notice the direction of the arrows - we find the probability of certain outcomes that “could” occur if the p=.65 model were controlling.

p1a <- ggplot(data=df1, aes(x=x1,y=density1)) +
        scale_x_continuous(breaks=seq(0,n,1)) +
        scale_y_continuous(limits=ylimits, expand=c(0,0)) +
        theme(aspect.ratio=3/4) +
        theme_classic() +
        labs(title=p1atitle,
             x="Number of Successes (x)",
             y="probability (density)")
#p1a
p2a <- p1a +
  geom_segment(
    aes(xend = x1, yend = 0),
    size = 1.6,
    lineend = "butt",
    col = "gray"
  ) +
  geom_point(
    size = 4,
    col = "black",
    fill = "gray",
    shape = 21
  ) +
  geom_point(aes(x = xa, y = dbinom(xa, n, p)),
             shape = 21,
             color = "steelblue3",
             size = 4) +
  geom_segment(
    x = xa,
    y = 0,
    xend = xa,
    yend = dbinom(xa, n, p),
    col = "steelblue3",
    size = 1,
    arrow = arrow(length = unit(0.5, "cm"))
  ) +
  geom_segment(
    x = xa,
    y = dbinom(xa, n, p),
    xend = -.5,
    yend = dbinom(xa, n, p),
    col = "steelblue3",
    size = 1,
    arrow = arrow(length = unit(0.5, "cm"))
    ) +
  annotate("text", x = 1.5, y = .21, label = p1aprobtext,col="steelblue3", size=5)
p2a

Or we might ask the probability of drawing six Blue Berry candies from a sample of 12:

p1b <- ggplot(data=df1, aes(x=x1,y=density1)) +
  scale_x_continuous(breaks=seq(0,n,1)) +
  scale_y_continuous(limits=ylimits, expand=c(0,0)) +
  theme(aspect.ratio=3/4) +
  theme_classic() +
  labs(title=p1btitle,
       x="Number of Successes (x)",
       y="probability (density)")
#p1b
p2b <- p1b +
  geom_segment(
    aes(xend = x1, yend = 0),
    size = 1.6,
    lineend = "butt",
    col = "gray"
  ) +
  geom_point(
    size = 4,
    col = "black",
    fill = "gray",
    shape = 21
  ) +
  geom_point(aes(x = xb, y = dbinom(xb, n, p)),
             shape = 21,
             color = "steelblue3",
             size = 4) +
  geom_segment(
    x = xb,
    y = 0,
    xend = xb,
    yend = dbinom(xb, n, p),
    col = "steelblue3",
    size = 1,
    arrow = arrow(length = unit(0.5, "cm"))
  ) +
  geom_segment(
    x = xb,
    y = dbinom(xb, n, p),
    xend = -.5,
    yend = dbinom(xb, n, p),
    col = "steelblue3",
    size = 1,
    arrow = arrow(length = unit(0.5, "cm"))
  ) +
  annotate("text", x = 1.5, y = .15, label = p1bprobtext,col="steelblue3", size=5)
p2b

In each of the cases just examined, we have specified a model (n=12 and p=.65) and then asked the probability of specific data outcomes happening if that model were controlling the sampling process:

prob(data|model).

Note that the most “likely” outcome under this model is 8 successes. Also note that the expected number of Blue Berries would be 7.8 (np = 12*.65), a value that cannot occur, but is nevertheless the expected value of this binomial distribution.

These are comfortable concepts for most who have worked with binomial distributions. We have been examining and using a Probability Distribution, even though specific outcomes are sometimes said to have “likelihoods”. This intermingling of the two words is comfortable because we are using them as interchangeable nouns. Now we can proceed to distinguish this probability DISTRIBUTION from a Likelihood DISTRIBUTION.

Visualize a Likelihood Distribution based on the binomial.

The conceptualization of a likelihood distribution involves turning the question around. Rather than asking prob(data|model), we will ask the probabilities of a specific outcome (e.g., 9 BB candies) under various models (changing “p”). The perspective here is that we still do the same sampling procedure, but we don’t know the probabilities of BB and CC ahead of time. Our goal is to estimate those parameters based on the data we collect. But we will use the same binomial probability approach to finding these probabilities (e.g., dbinom). We will specify the one outcome (9 BB candies and 3 CC candies) and examine how likely that outcome is under various models (various values of “p”).

First, here is the full Likelihood distribution. Note that it is continuous and bounded by the values in which “p” can range. The Y axis is still density/probability. But area does not sum to 1.0 as was the case with the probability distributions above.

pv <- as.vector(seq(0,1,.01))
xl <- 9
x2 <- 6
prop <- xl/n
density2 <- dbinom(xl,size=n, prob=pv)
density3 <- dbinom(x2,size=n, prob=pv)
df2 <- data.frame(pv,density2,density3)
ylimits2 <- 
max1 <- (max(df2$density2)) + (max(df2$density2*.1))
max2 <- (max(df2$density3)) + (max(df2$density3*.1))
ylimits2 <- c(0,max(max1,max2))
p3 <- ggplot(data = df2, aes(x = pv, y = density2)) +
  geom_line() +
  scale_x_continuous(breaks = seq(0, 1, .1)) +
  # scale_y_continuous(limits=ylimits2, expand=c(0,0),
  #                   breaks=seq(0,ylimits2, by=.05)) +
  scale_y_continuous(
    limits = ylimits2,
    expand = c(0, 0),
    breaks = seq(0, ylimits2[2], ylimits2[2] / 5),
    labels = scales::number_format(accuracy = 0.001)
  ) +
          labs(title="Likelihoods for various parameter values",
             subtitle="Based on a fixed number of successes, 9",
             x="Parameter Value (p)",
             y="Liklihood (Density)") +
  theme_classic() 
p3

In order to grasp where this curve came from, initially, we can locate the probability that we found for nine successes in 12 samples in the above illustration. That probability was .195, when p=.65. But notice from this likelihood distribution plot that the probability of finding nine successes varies a good bit, depending on the model - the value of “p”.

We can now see that we simply recalculate the binomial probability of nine successes under various models where “p” ranges from 0 to 1.0.

So, a likelihood distribution shows probabilities of specified outcomes of data under various models. Rather than examining all possible outcomes (x varying from 0 to 12 BB candies) under one model (p is fixed at .65 as above), we are now assessing probabilities of “one” outcome (x=9) under various models. The parameter value is now the “random” variable. But this “likelihood” distribution still uses binomial probabilities/densities. The parameter value is the variable that can change under this perspective, but whether it is a true “random” variable becomes a discussion of perspective that is not unconnected to Bayesian ideas.

p4 <- p3 + theme(aspect.ratio=3/4) +
        labs(title="Likelihoods for various parameter values",
             subtitle="Highlighted for the initial p value specified in the earlier example") +
        geom_segment(x=p, y=0,xend=p, yend=dbinom(xl,n,p),col="steelblue3",
                     size = 1.6, 
                     lineend = "butt") +
        geom_point(aes(x=p, y=dbinom(xl,n,p)), shape=21, fill="steelblue3", size=4) +
        annotate("text", x = .38, y = .2, label = p1aprobtext,col="steelblue3", size=5)
p4

Now let’s consider what the likelihood outcomes look like if the data were different. The plots just above were based on the observed data of 9 successes (BB) in 12. What if there had been only six? Then a different likelihood distribution would emerge.

sixdistrib <- paste0("Six Successes")
ninedistrib <- paste0("Nine Successes")

p4b <- p3 + theme(aspect.ratio = 3 / 4) +
  labs(
    title = "Likelihoods for various parameter values",
    subtitle = "Two Likelihood distributions for the two different data outcomes\n one for X=9, and one for X=6",
    x = "Parameter Value (p)",
    y = "Liklihood (Density)"
  ) +
  # geom_segment(x=p, y=0,xend=p, yend=dbinom(xl,n,p),col="steelblue3",
  #              size = 1.6,
  #              lineend = "butt") +
  # geom_point(aes(x=p, y=dbinom(xl,n,p)), shape=21, fill="steelblue3", size=4) +
  # annotate("text", x = .38, y = .2, label = p1aprobtext,col="steelblue3", size=5) +
  geom_line(aes(x = pv, y = density3)) +
  annotate("text", x = .25, y = .2, label = sixdistrib,col="steelblue3", size=5) +
  annotate("text", x = .54, y = .26, label = ninedistrib,col="steelblue3", size=5) 
p4b

Dynamically/Interactively visualize the probability and likelihood distributions

A shiny app can permit visualizing the interrelationship between the binomial mass probability function/distribution and the likelihood distribution. Use the slider and drop down menu to change features.

If the shiny app does not show within this document, use the following link:

https://bcdudek.net/binom_likelihood

Moving toward finding “Maximum” Likelihood

At this point, it is useful/interesting to identify the peak value in one of the likelihood distributions since that is the goal of “maximum” likelihood estimation. Let’s use the one based on nine successes Notice that the arrows are in the opposite direction to those we visualized in the binomial probability distribution because here we are interested in which p value would make our observation of nine in 12 the most likely.

Unsurprisingly this value of p is .75, the sample proportion of nine in 12.

maxdensity <- max(df2$density2)
maxprob <- df2[which(df2$density2==maxdensity),]$pv
subtitle <- paste0("Found when p=x/n, the sample proportion (",xl,"/",n, ("), which is "),prop,")")
p5 <- p3 +         theme(aspect.ratio = 3 / 4) +
  labs(
    title = "Maximum Likelihood",
    subtitle = subtitle,
    x = "Parameter Value (p)",
    y = "Liklihood (Density)"
  ) +
  # geom_segment(x=x/n, y=0,xend=x/n, yend=maxdensity,col="blue",
  #              size = 1.6,
  #              lineend = "butt") +
  geom_segment(
    x = maxprob,
    yend = 0,
    xend = maxprob,
    y = maxdensity,
    col = "steelblue3",
    size = 1,
    lineend = "butt",
    arrow=arrow(length=unit(0.45,"cm"))
  ) +
  geom_point(
    aes(x = maxprob, y = maxdensity),
    shape = 21,
    fill = "gray",
    color="blue",
    size = 4
  ) +
  geom_segment(
    x = 0,
    xend = 1,
    y = maxdensity,
    yend = maxdensity,
    col = "gray"
  ) +
  geom_segment(
    x = -.5,
    yend = maxdensity,
    xend = maxprob,
    y = maxdensity,
    col = "steelblue3",
    size = 1,
    lineend = "butt",
    arrow=arrow(length=unit(0.45,"cm"))
  ) 
p5

Theory of likelihood functions.

In order to develop MLE theory, it is convenient to change the perspective slightly. Above, we posed the question of the probability of drawing 9 BB candies in a sample of 12. Let’s change perspective while still keeping the same core question in place. Assume now that rather than having one sample of 12 candies (n=12), let’s assume that we did 12 successive samples of size 1 (n=12). Technically this is a bernoulli distribution but can also be seen as a repetition of 12 binomial distributions where n=1.

Jelly Beans

Here, a numeric vector with 12 entries is created. A zero means the candy was CC and a one means that it was BB. So with this vector, you can see that there are 9 ones and 3 zeros, the same 9&3 outcome modeled above.

Recall that with this sampling situation, basic frequentist principles would say that the best estimate we have of “p” the underlying binomial probability of a success would be x/n or 9/12 = .75, the sample proportion (P)……..Est(p)=P.

Now let’s try to converge on that value, using principles of maximum likelihood estimation. First we will create a data vector of these 12 samples of size 1. We also define an object that reflects sample size of one.

outcome <- c(0,1,1,0,1,1,1,0,1,1,1,1)
n <- 1

If we assume that the true probability of BB in the candy jar is .65, then we could calculate the probabilities of seeing each of the set of outcomes above, using dbinom. Notice that each probability will either be .65 or .35, as expected when we draw individual candies.

probs <- dbinom(outcome,size=n,prob=.65)
probs
##  [1] 0.35 0.65 0.65 0.35 0.65 0.65 0.65 0.35 0.65 0.65 0.65 0.65

We can rephrase the question again, to be “what is the probability of this set of outcomes?” The answer is that each draw is independent so the joint probability is the product of the twelve individual probabilities which is:

# `prod` computes the product of all elements of the vector
prod(probs)
## [1] 0.0008880233

Using this product will suffice for maximum likelihood estimation But it can be argued that this isn’t a complete answer because we need to take into account the number of different orders that the nine BB and three CC outcomes might have happened in:

choose(12,9)
## [1] 220

So, the full answer is that the probability is:

220*prod(probs)
## [1] 0.1953651

And not surprisingly, this recovers the same answer to the same question that we found with the graphical displays above. So the answer is once again, the binomial probability. And we see the binomial expression includes the 220 value as the binomial coefficient:

for our illustration: \(\left( \begin{matrix} 12 \\ 9 \\ \end{matrix} \right)*{{.65}^{9}}*{{.35}^{3}}\,\,=\,\,.1953651\)

We also found the same thing earlier with dbinom and repeat it here:

dbinom(9,12,.65)
## [1] 0.1953651

The distinction between including (or not) the \(\left( \begin{matrix} 12 \\ 9 \\ \end{matrix} \right)\) is not important for the likelihood distribution. This is because the unordered combination quantity (the binomial coefficient) is a constant that can be ignored. But it is important to make the point of equivalence of doing one sample with an n of 12 and 12 samples with an n of 1 from the point of view of the binomial.

What is the next step in developing the understanding? Our interest is in finding the maximum value for this probability when varying the “p” value - as we did in the graphical depiction above - we already know that it is .75.

The problem with this approach:

When sample size is large or the p and q values are divergent, the product of the n-way probability is potentially very small. We know that computers begin to struggle with values that are in extreme numbers of decimal places. If this were not the case then we would have already covered what would be the primary strategy for “maximum likelihood”. So what to do?

A mathematical trick is to convert the binomial probabilities to logs. In the log scale (natural logs are used) the potentially long series of products reduces to a summation. We can easily do this in R by adding an argument to dbinom that converts to logs:

dbinom(outcome, size=1, prob=.65, log=T)
##  [1] -1.0498221 -0.4307829 -0.4307829 -1.0498221 -0.4307829 -0.4307829
##  [7] -0.4307829 -1.0498221 -0.4307829 -0.4307829 -0.4307829 -0.4307829

And summing that vector produces our “Log-likelihood”

sum(dbinom(outcome, size=1, prob=.65, log=T))
## [1] -7.026513

Find the Log Likelihood Distribution

The next step would be to define that idea in a function that can be use to vary the “p” parameter value and redraw the likelihood distribution in these log units.

logL <- function(p) sum(dbinom(outcome, size=1, p,log=TRUE))

Let’s test it by applying it to our outcome vector using a p value of .65 again. We return the same value as done directly. This is good because now we can use the function as a way to vary “p” and produce a “Log-likelihood” distribution.

logL(.65)
## [1] -7.026513

To draw a graph of all outcomes passing p values from 0 to 1 to the plot function, we first develop the sequence of X axis values in this range. Then we pass that vector of “p” values as the x axis variable and simultaneously apply them to the logL function to obtain the y axis values in the plotting code.

p.seq <- seq(.01, 0.99, 0.01)
logL1 <- sapply(p.seq, logL)
logLdf1 <- data.frame(p.seq, logL1)
ggplot(data=logLdf1, aes(x=p.seq, y=logL1)) +
         geom_line() +
    labs(
    title = "Likelihoods for various parameter values",
    subtitle = "Based on binomial: number of successes=9",
    x = "Parameter Value (p)",
    y = "Log Liklihood"
  ) 

If we save the values produced by applying this function to the range of p values, then we can examine the results and look for the parameter value that has the largest likelihood.

vals1 <- as.data.frame(cbind(p.seq ,sapply(p.seq, logL)))
names(vals1) <- c("p","LogL")
headTail(vals1)
##        p   LogL
## 1   0.01 -41.48
## 2   0.02 -35.27
## 3   0.03 -31.65
## 4   0.04 -29.09
## ...  ...    ...
## 96  0.96 -10.02
## 97  0.97 -10.79
## 98  0.98 -11.92
## 99  0.99 -13.91

Now we can sort this matrix by the likelihood value in the second column, in descending manner. This will place the case with the largest likelihood as the first row in the new data frame.

We will use tidyverse methods for this and we see that the probability associated with that largest log likelihood is the expected .75!!!!!!! And this jives with the visual impression from the graph.

vals1a <- dplyr::arrange(vals1, -LogL)
head(vals1a)
##      p      LogL
## 1 0.75 -6.748022
## 2 0.74 -6.751167
## 3 0.76 -6.751281
## 4 0.73 -6.760397
## 5 0.77 -6.761311
## 6 0.72 -6.775434

And now we can visualize this maximum likelihood on the previous plot:

maxloglike1 <- vals1a[1,2]
maxp1 <- vals1a[1,1]
ggplot(data=logLdf1, aes(x=p.seq, y=logL1)) +
         geom_line() +
    geom_segment(
    x = -.5,
    xend = maxp1,
    y = maxloglike1,
    yend = maxloglike1,
    col = "steelblue3", 
    size=1.4,
    lineend = "butt",
    arrow=arrow(length=unit(0.45,"cm"))
  ) +
  geom_segment(
    x = maxp1,
    xend = maxp1,
    y = maxloglike1,
    yend = -42,
    col = "steelblue3",
    size = 1.4,
    lineend = "butt",
    arrow=arrow(length=unit(0.45,"cm"))
  ) +
    labs(
    title = "Log Likelihoods for various parameter values",
    subtitle = "Based on binomial: number of successes=9, N=12",
    x = "Parameter Value (p)",
    y = "Log Liklihood"
  ) 

Minimize the negative log likelihood

Software algorithms that do maximum likelihood apparently work better if they are structured to find the smallest rather than largest value. In order to do that, we can simply take the negative of the log likelihood that we constructed above and look for the minimum value.

First, we create the likelihood function, differing from logL only by the sign in front of the sum

nlogL <- function(p) -sum(dbinom(outcome, size=1, p,log=TRUE))

We can find the likelihood when p=.65. It is the same value but opposite sign.

nlogL(.65)
## [1] 7.026513

And the plot suggests that the minimum of the negative log likelihood is in the same .75 region.

p.seq <- seq(.01, 0.99, 0.01)
nlogL1 <- sapply(p.seq, nlogL)
vals2 <- data.frame(p.seq ,sapply(p.seq, nlogL))
names(vals2) <- c("p", "nlogL")
vals2a <- dplyr::arrange(vals2, nlogL)
minnloglike1 <- vals2a[1,2]
minnp1 <- vals2a[1,1]
ggplot(data=vals2, aes(x=p.seq, y=nlogL)) +
  ylim(5,45) +
         geom_line() +
    geom_segment(
    x = -.5,
    xend = minnp1,
    y = minnloglike1,
    yend = minnloglike1,
    col = "steelblue3", 
    size=1.4,
    lineend = "butt",
    arrow=arrow(length=unit(0.3,"cm"))
  ) +
  geom_segment(
    x = minnp1,
    xend = minnp1,
    y = minnloglike1,
    yend = 3.5,
    col = "steelblue3",
    size = 1.4,
    lineend = "butt",
    arrow=arrow(length=unit(0.3,"cm"))
  ) +
    labs(
    title = "Negative Log Likelihoods for various parameter values",
    subtitle = "Based on binomial: number of successes=9, N=12",
    x = "Parameter Value (p)",
    y = "Negative Log Liklihood"
  ) 

Once again, if the values of all of the negative log likelihoods are placed in a data frame along with the sequence of “p” values then we will be able work with them to find the smallest nlogL.

vals2 <- data.frame(p.seq ,sapply(p.seq, nlogL))
names(vals2) <- c("p", "nlogL")
headTail(vals2)
##        p nlogL
## 1   0.01 41.48
## 2   0.02 35.27
## 3   0.03 31.65
## 4   0.04 29.09
## ...  ...   ...
## 96  0.96 10.02
## 97  0.97 10.79
## 98  0.98 11.92
## 99  0.99 13.91

This time, we want to find the minimum negative log likelihood. So we sort in the opposite order. As expected and as seen from the plot above, the minimum negative log likelihood value occurs for a “p” value of .75.

vals2a <- dplyr::arrange(vals2, nlogL)
head(vals2a)
##      p    nlogL
## 1 0.75 6.748022
## 2 0.74 6.751167
## 3 0.76 6.751281
## 4 0.73 6.760397
## 5 0.77 6.761311
## 6 0.72 6.775434

These approaches have been done with two limiting contexts. One is that we used a simplistic sampling situation (single candy draws at a time). The other is that we did all of the calculations by brute force. This works, but is not very flexible for other situations. The next section outlines use of optimization functions in R that decrease the workload.

Software algorithms for finding the maximum Likelihood.

There are several different ways to implement maximum likelihood estimation R. In this section we will first show some of them, using the previous binomial illustration. And following that, we will need to change the perspective on the sampling slightly. This followup illustration will be more similar to a true data analytic situation.

The optimize function in R found this same answer (within a few decimals of precision). It returns a quantity called “maximum” with is the parameter that maximized the likelihood function. It is very close to the .75 value that we derived above. the “objective” quantity produced by optimize is the largest log likelihood value, thus the “maximum”. This objective quantity matches the one we found manually above. Note that we still have to define the log liklihood function and pass it to the optimize function the way we did for our manual approach above.

# optimize won't do negative LL'
n <- 1
outcome2 <- c(0,1,1,0,1,1,1,0,1,1,1,1)
logL <- function(p) sum(dbinom(outcome2, n, p, log=T))
optimize(logL,lower=0,upper=1,maximum=TRUE)
## $maximum
## [1] 0.7499857
## 
## $objective
## [1] -6.748022

Since the maximized “p” value is not exactly on the .75 value we expected (although very close) the algorithm must be doing computations in a different way. But before we consider that, lets look at two other optimization functions in R. First is the mle2 function. It will work on the negative log likelihood function. The “start” argument sets a starting point parameter value for the function to begin its maximization algorithm. Once again, the outcome is a function that is maximized very near the expected .75.

nlogL <- function(p) -sum(dbinom(outcome2, size=1, p,log=TRUE))
bbmle::mle2(nlogL, start = list(p = 0.5)) 
## 
## Call:
## bbmle::mle2(minuslogl = nlogL, start = list(p = 0.5))
## 
## Coefficients:
##         p 
## 0.7499989 
## 
## Log-likelihood: -6.75

The third optimization function to utilize is optim. It is probably the most commonly use MLE optimization function in the R ecosystem. We will use the negative log likelihood function with optim. The “Brent” method is recommended when the sampling situation is binomial. Once again we see that in converges on the expected .75 value with the same likelihood value

optim(par=.5, fn=nlogL,
      method="Brent",
      lower=0.001,upper=0.999)
## $par
## [1] 0.75
## 
## $value
## [1] 6.748022
## 
## $counts
## function gradient 
##       NA       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Now, we can wonder how this optimization is working. It is an iterative process, not necessarily examining all possible parameter values. Its efficiency is in the fact that the algorithm takes relatively few steps to find the “maximum likelihood” solution. But lets save out all of the values that it found at each the steps that it took. The “optim_save” function if found in the visualMLE package provides that functionality.

set.seed(16543)
it1 <- visualMLE::optim_save(.50,nlogL,method="Brent",lower=0.001,upper=0.999)
it1_df <- it1$iterations_df
names(it1_df) <- c("p", "NegLogLikelihood", "Iteration")
gt::gt(it1_df)
p NegLogLikelihood Iteration
0.3822021 10.101034 1
0.6177979 7.219762 2
0.7634042 6.753915 3
0.7581290 6.750168 4
0.7480184 6.748147 5
0.7500616 6.748022 6
0.7499859 6.748022 7
0.7499999 6.748022 8
0.7500000 6.748022 9
0.7500000 6.748022 10
0.7500000 6.748022 11
0.7500000 6.748022 12

Visualizing this sequence may be helpful - this kind of visualization will be more helpful in more complex illustrations.

df1.melt <- melt(it1_df, id.var = 'Iteration')
names(df1.melt) <- c("Iteration", "Parameter", "Value")
p_it1 <- ggplot(df1.melt, aes(x = Iteration, y = Value, color = Parameter)) +
  geom_vline(aes(xintercept=Iteration), color="grey75") +
  geom_line() +
  geom_point(size=2) +
  facet_wrap(~ Parameter, scales = "free_y", ncol = 1) +
  xlab('Iteration') + ylab('Parameter Estimate') +
  scale_color_manual(values=c("steelblue2", "black"))+
  theme(legend.position = "none") +
  transition_reveal(Iteration) 
#p_it1
p_it1_gif <- animate(p_it1, width = 480, height = 480)
p_it1_gif

Using optim with a more complex binomial illustration.

Now we will change the sampling situation again. Let’s keep n=12, but replicate the experiment 15 times. The image shows the result of these 15 different replications. This can be envisioned as more similar to a realistic data situation than the above illustration. It is now an example where 15 replicates of a sample are performed and we want to use the joint outcome as our best estimate of the parameter value.

Jelly Beans

In the data vector, each value represents the number of BB candies out of 12 that were sampled. N is 12 from the point of view of the binomial distribution, but there were 15 replications of each N=12 study. Here are the data and the likelihood function (negative) to use - note its similarity to the simpler version above where we defined n=1.

y2 <- c(6,8,7,4,9,7,2,9,5,7,8,9,3,7,10)
n <- 12
nlogL <- function(p) -sum(dbinom(y2, n, p, log=T))

The optim_save function runs the optim function and saves the values at each iteration. Here, once again, there were relatively few iterations. The algorithm converged on a maximized “p” value of about .561.

set.seed(16543)
it2 <- visualMLE::optim_save(.50,nlogL,method="Brent",lower=0.001,upper=0.999)
it2_df <- it2$iterations_df
names(it2_df) <- c("p", "NegLogLikelihood", "Iteration")
gt::gt(it2_df)
p NegLogLikelihood Iteration
0.3822021 46.41929 1
0.6177979 35.85463 2
0.7634042 52.36848 3
0.5540071 34.66807 4
0.5607169 34.64973 5
0.5610196 34.64967 6
0.5611117 34.64967 7
0.5611111 34.64967 8
0.5611111 34.64967 9
0.5611111 34.64967 10
0.5611111 34.64967 11

And the plot of this second MLE set of iterations is shown next. The first iteration began with a “p” value near .38, then it moved high for a few iterations and then moved back down to near the final converged value after only about 4 iterations.

df2.melt <- melt(it2_df, id.var = 'Iteration')
names(df2.melt) <- c("Iteration", "Parameter", "Value")
p_it2 <- ggplot(df1.melt, aes(x = Iteration, y = Value, color = Parameter)) +
  geom_vline(aes(xintercept=Iteration), color="grey75") +
  geom_line() +
  geom_point() +
  facet_wrap(~ Parameter, scales = "free_y", ncol = 1) +
  xlab('Iteration') + ylab('Parameter Estimate') +
  scale_color_manual(values=c("steelblue2", "black"))+
  theme(legend.position = "none") +
  transition_reveal(Iteration)

p_it2_gif <- animate(p_it2, width = 480, height = 480)
p_it2_gif

In order to put this in context, lets find the mean number of BB candies in the 15 runs of the n=12 samples and divide it by 12 to obtain the direct approach to estimation of “p”. Unsurprisingly, we find the same value of “p” as our best estimate of the true probability of a success (Drawing BB candies).

mean(y2/12)
## [1] 0.5611111

The reader may be wondering, at this point, why we went to all the trouble of doing maximum likelihood when we could have used the direct approach of finding the sample proportion, or average proportion to estimate “p”. The answer is that we could and they would be good estimates. However the value of these illustrations is in the fact that with these simple sampling situations we could fully visualize the concepts as we developed them. Application to more complex analyses will not be as obviously approached with direct methods. Another useful perspective developed by these simplistic illustrations is that maximum likelihood methods are essentially frequentist methods.

The next section increases the complexity a bit more by assessing distributions of continuous variables rather than the discrete variable examined with the binomial.

Move to a continuous distribution based example

Goal: Best estimates of population mean (\(\mu\)) and sd (\(\sigma\)) using a sample of data and assuming normality of the distribution from which the sample is drawn.

The use of the sample mean (\(\bar{X}\)) to estimate \(\mu\) and the sample standard deviation (sd) to estimate (\(\sigma\)) are well understood. The sample mean is an unbiased estimate and the sample standard deviation is biased downward, depending on sample size. We can also use maximum likelihood theory to converge on the estimates of the \(\mu\) and \(\sigma\) parameters. The value of this illustration is moving from a discrete distribution to a continuous one and simultaneously estimating two parameters from observed data.

We begin by creating a random sample of n=15 scores drawn from a normal distribution. This sample was constructed to have a mean of exactly 100 and a standard deviation of exactly 15.

set.seed(1465)
x = rnorm(15)
# rescale x to a specified mean/sd.
xdev <- x-(mean(x))
xz <- xdev/(sd(x))
x1 <- (xz*15)+100
round(x1,2)
##  [1] 106.32 108.57  93.26 107.76  99.87  88.52  91.38  93.28 111.38  93.41
## [11]  78.93 112.96 133.61 107.38  73.35

For programming convenience, we place these data in a data frame and double check that the mean and sd are the correct values.

xdf <- as.data.frame(x1)
c('mean'=mean(xdf$x1),'sd'=sd(xdf$x1)) # double check
## mean   sd 
##  100   15

A dotplot permits visualization of this sample distribution and comparison to a normal distribution with the same \(\mu\) and \(\sigma\) values as the sample estimates.

p6 <- ggplot(xdf, aes(x = x1)) +
  # geom_histogram(aes(y =..density..),
  #                color="black",
  #                fill="white") +
  xlim(c(60, 140)) +
  geom_point(y = 0, size = 1.6) +
  stat_function(fun = dnorm,
                args = list(mean = mean(xdf$x),
                            sd = sd(xdf$x))) +
  labs(title = "Sample data with hypothetical normal of same Mu and Sigma")
suppressMessages(print(p6))

In continuous distributions such as this gaussian, the density concept is distinct from its identity with probability as seen with discrete distributions such as the binomial above. Density is f(x), or just the “height” of the curve at a specified value of X. We can find such densities for any X value. We will use the densities in the likelihood calculations for continuous variables, recognizing that they are distinct from the probability concept develop with the binomial where density and probability were the same thing.

With continuous distributions we characterize probabilities as areas under specified regions of the curve. This plot helps visualize the density and the cumulative left tail probability for the third lowest X value in the simulated data:

# sort xdf to find third lowest value
xdf <- dplyr::arrange(xdf, x1 )
quantile <- xdf[3,1]
x3prob <- round(pnorm(quantile,mean=100, sd=15, lower.tail=TRUE), 3)
label1 <- paste0(" cum prob=\n", x3prob)

pn1 <- ggplot(NULL) + xlim(55, 145) +
  stat_function(
    fun = dnorm,
    geom = "area",
    fill = "steelblue",
    alpha = .3,
    args = list(mean = 100, sd = 15)
  ) +
  stat_function(
    fun = dnorm,
    geom = "area",
    fill = "steelblue",
    alpha = .3,
    args = list(mean = 100, sd = 15),
    xlim = c(0, quantile)
  ) +
  xlab('X') + ylab('Density') +
  geom_segment(data = xdf,
               aes(x = x1),
    x = quantile,
    xend =quantile,
    y = 0,
    yend =dnorm(quantile, mean = 100, sd = 15),
    col = "gray55",
    size = 1.2,
    lineend = "butt"
  ) +
  geom_segment(data = xdf,
               aes(x = x1),
               x = quantile,
               xend =52,
               y = dnorm(quantile, mean = 100, sd = 15),
               yend =dnorm(quantile, mean = 100, sd = 15),
               col = "gray55",
               size = 1.2,
               arrow = arrow(length = unit(0.5, "cm"))
  ) +
  geom_point(data = xdf,
             aes(x = x1),
             y = 0,
             size = 3)+
    annotate("text", x = 82, y = .007, 
             label = label1, 
             col="black", 
             size=4) 

pn1

And the exact values of these quantities are:

dnorm(xdf[3,1],mean=100,sd=15)
## [1] 0.01983977
pnorm(quantile,mean=100, sd=15, lower.tail=TRUE)
## [1] 0.2219547

Mimicking what was done for the binomial example above, we could obtain the densities for each of the x values if the distribution were normal with a specified set of \(\mu\) and \(\sigma\) parameters - using the dnorm function.

# pass hypothetical mu and sigma parameters to dnorm
dnorm(xdf$x1, mean=100, sd=15)
##  [1] 0.005485415 0.009918375 0.019839767 0.022549379 0.024045035 0.024060063
##  [7] 0.024148026 0.026595170 0.024334071 0.023562026 0.023261520 0.022590399
## [13] 0.019945411 0.018311133 0.002159461

Using the independence rule logic and taking the product of that set of densities, as we did in the binomial illustration would produce a likelihood statistic that reflects the probability of obtaining that sample. However, it would be a very small value, taking the product of 15 small decimal quantities. This will be rectified with the same logarithm trick used above.

Taking the product of all of the densities also does not take into account the different possible orders that the same values may have been obtained in different samples. It turns out that this sequences adjustment does not make any difference in finding the maximum or minimum likelihood (since it is a constant) so we can ignore it.

Here are the log densities for those 15 data points

dnorm(xdf$x1, mean=100, sd=15, log=T)
##  [1] -5.205663 -4.613366 -3.920067 -3.792048 -3.727827 -3.727202 -3.723553
##  [8] -3.627026 -3.715878 -3.748119 -3.760955 -3.790230 -3.914756 -4.000246
## [15] -6.137897

Summing the 15 log densities gives our likelihood statistic for these 15 data values, based on the premise of sampling from a \(\mu=100\) and \(\sigma=15\) normal distribution.

sum(dnorm(xdf$x1, mean=100, sd=15, log=T))
## [1] -61.40483

We might compare that value to one where the hypothesized parameters are different. The larger negative log likelihood implies that this is a less likely outcome when \(\mu=130\) and \(\sigma=15\).

sum(dnorm(xdf$x1, mean=130, sd=15, log=T))
## [1] -91.40483

In order to obtain the likelihood statistic for many possible combinations of \(\mu\) and \(\sigma\) it is convenient to write a function. This function also changes the sign of the sum to the negative of the above quantity.

llik1 <- function(x,mu,sigma) {
  m=mu
  s=sigma
  #n=length(x)
  -sum(dnorm(x, mean=m, sd=s, log=T))
}

Use of that function repeats some of the manual findings from above and shows the utility for extending to more combinations of the two parameters.

llik1(xdf$x1,mu=100,sigma=15)
llik1(xdf$x1,mu=130,sigma=15)
llik1(xdf$x1,mu=135,sigma=20)
## [1] 61.40483
## [1] 91.40483
## [1] 85.62631

Visualizing the likelihood for combinations of \(\mu\) and \(\sigma\) parameters

It would still be a large amount of work to manually use the function just created in order to find likelihoods for a large number of combinations. This section automates that approach and creates a visualization of the three dimensional relationship between the two parameters and the likelihood statistic.

We begin by recreating the same data vector as used above, adding it to a data frame, and double checking that the mean and sd are the expected quantities.

set.seed(1456)
x = rnorm(15)
# rescale x to a specified mean/sd.
xdev <- x-(mean(x))
xz <- xdev/(sd(x))
x <- (xz*15)+100
xdf <- as.data.frame(x)
c('mean'=mean(xdf$x),'sd'=sd(xdf$x)) # double check
## mean   sd 
##  100   15

Next, we set up some values for the range of values of the mu and sigma parameters, and we also establish an object for sample size. These vectors contain a sequences that step by .25 units providing a reasonable approximation to continuous scales that can be visualized below.

n=length(xdf$x)
xx <- seq(85, 115, .25)
ya <- seq(8,35,.25)

But having these two ranges is not enough, we need to combine them in all possible combinations of \(\mu\) and \(\sigma\) quantities found in those vectors. The grid function in R permits this. When done, we find that there are 13189 unique combinations of \(\mu\) and \(\sigma\) in our new object. They are called “x” and “y” for \(\mu\) and \(\sigma\) for purposes related to use in the plotting method below.

grid <- as.data.frame(expand.grid(xx,ya))
names(grid) <- c("x","y")
grid$x <- as.numeric(grid$x)
grid$y <- as.numeric(grid$y)
dim(grid)[1]
## [1] 13189

Next we produce the likelihood statistic for each of the 13189 combinations of \(\mu\) and \(\sigma\) produced in our grid and add those values to the grid data frame. We can look at the first and last few rows in that data frame.

for (i in 1:dim(grid)[1]) {
  grid$z[[i]]  <- -sum(dnorm(xdf$x, mean=grid[i,1], sd=grid[i,2], log=T))
}
grid$z <- as.numeric(grid$z)
names(grid) <-c("Mu","Sigma","Likelihood")
#str(grid)
headTail(grid)
##           Mu Sigma Likelihood
## 1         85     8      95.95
## 2      85.25     8      95.08
## 3       85.5     8      94.22
## 4      85.75     8      93.38
## ...      ...   ...        ...
## 13186 114.25    35      69.64
## 13187  114.5    35      69.69
## 13188 114.75    35      69.73
## 13189    115    35      69.78

Since we are estimating two parameters, any visualization of this outcome will have to be a three dimensional plot - it would be a surface. We have found that the visualization works best with a 3D scatterplot representing the values just placed in the “grid” data frame. The impression of a 3D surface seems to be adequate with the scatterplot. The plot can be rotated with the mouse.

scatter3D(grid$Mu, grid$Sigma, grid$Likelihood,
          phi = 0, theta=60,
          bty = "g",
          pch = 16, cex = .15, ticktype = "detailed",
          col="steelblue",
          xlab="Mu", ylab= "Sigma", zlab ="Liklihood",
          add=FALSE)
plotrgl()

The next useful step would be to find the minimum negative Likelihood value and the associated parameter estimates:

best <- grid[which(grid$Likelihood==min(grid$Likelihood)),]
best
##       Mu Sigma Likelihood
## 3207 100  14.5   61.38739

With the coarse and non-continuous dimensions for \(\mu\) and \(\sigma\) created in the grid, it is not surprising that the estimates are not both exactly on 100 and 15 - more on that is addressed below.

If we add a visual indicator of the minimum Likelihood value, we can visualize this best estimate on the “surface” and once again the plot can be rotated with the mouse.

m <- best$Mu
s <- best$Sigma
l <- best$Likelihood
scatter3D(grid$Mu, grid$Sigma, grid$Likelihood,
          phi = 0, theta=65,
          bty = "g",
          pch = 16, cex = .15, ticktype = "detailed",
          col="steelblue",
          xlab="Mu", ylab= "Sigma", zlab ="Liklihood",
          add=FALSE)
scatter3D(m,s,l, 
          phi = 0, theta=65,
          bty = "g",
          pch = 16, cex = 2, ticktype = "detailed",
          col="red",
          add=TRUE)
plotrgl()

The shape of the surface is interesting in its own right, but our purpose here is to consolidate the idea of finding the parameters that are maximally likely and to build on the foundation that we laid with the binomial visualization.

Using optim to automate the process of maximum likelihood estimation.

Next, it is useful to make the process more efficient by using the optim function in R to do the maximization. optim requires three arguments:

  1. a set of “starting” point values for the search.

  2. the function to be used - we defined it above as llik1 but recreate it here inside the function.

  3. the data on which to base the optimization.

The likelihood function is the same one used above, but recreated here with a name change to remind us that it gives the negative log likelihood:

nllik1 <- function(x,par) {
  xmean=par[1]
  xstdev=par[2]
  #n=length(x)
  -sum(dnorm(x, mean=xmean, sd=xstdev, log=T))
}

The optimization function is run with arbitrary starting values of 85 and 10 for mu and sigma and then the estimates are extracted from the object produced and tabled along with the actual sample mean and sd.

res0 = optim(par=c(85,10), nllik1, x=xdf$x)
print(kable(
  cbind('data'=c('mean'=mean(xdf$x),'sd'=sd(xdf$x)),
        'optim'=res0$par),digits=4))
## 
## 
## |     | data|    optim|
## |:----|----:|--------:|
## |mean |  100| 100.0002|
## |sd   |   15|  14.4893|

As was done with the binomial illustration, it is possible to extract the estimated parameters (\(\mu\) and \(\sigma\)) at each iteration in the optim process and this permits graphs displaying the progress of the maximization.

# source("optim_save.R")
set.seed(1465)
norm1 <- visualMLE::optim_save(c(85,10), nllik1, x=xdf$x)
it2_dfa <- norm1$iterations_df
headTail(it2_dfa)
##     Param1 Param2 Result Iteration
## 1       85     10  80.95         1
## 2     93.5     10  67.24         2
## 3       85   18.5  67.08         3
## 4     93.5   18.5  63.08         4
## ...    ...    ...    ...       ...
## 50     100  14.49  61.39        50
## 51   99.99  14.49  61.39        51
## 52     100   14.5  61.39        52
## 53     100  14.49  61.39        53

Now we can visualize the progress of the iterations as they optimize the parameters. The panel on the left shows the original data (points) and each estimated normal distribution across the iterations. The panel on the right shows the actual \(\mu\) and \(\sigma\) values as the iterations progress and converge to the final values.

names(it2_dfa) <- c("Mu", "Sigma", "NegLogLikelihood", "Iteration")
dfa.melt <- melt(it2_dfa, id.var = 'Iteration')
names(dfa.melt) <- c("Iteration", "Parameter", "Value")

fps <- 2
scale_range <- 2 # How many standard deviations around the mean should the x-axis include. This will use the smallest and largest mu from optim, not the raw data.
max_sigma <- max(it2_dfa$Sigma)
min_mu <- min(it2_dfa$Mu)
max_mu <- max(it2_dfa$Mu)
xlim <- c(min_mu = min_mu - scale_range * max_sigma,
          max_mu = max_mu + scale_range * max_sigma)
n_iter <- nrow(it2_dfa)
n_steps <- 100

dists <- data.frame(Iteration = rep(1:n_iter, each = n_steps),
                    x = rep(seq(from = xlim[1], to = xlim[2], length = n_steps)),
                    y = NA_real_)
for(i in 1:nrow(dists)) {
    dists[i,]$y <- dnorm(dists[i,]$x,
                         mean = it2_dfa[dists[i,]$Iteration,]$Mu,
                         sd = it2_dfa[dists[i,]$Iteration,]$Sigma)
}

ylim <- c(0, max(dists$y) * 1.05)
# note that xdf is defined earlier in the document when the data were originally defined
p_it3a <- ggplot(xdf, aes(x = x)) +
    geom_polygon(data = dists, aes(x = x, y = y, group = Iteration),
                 fill = '#E69F00', alpha = 0.1) +
    # Make this a function parameter
    geom_function(fun = dnorm, args = list(mean = it2_dfa[nrow(it2_dfa),]$Mu,
                                           sd = it2_dfa[nrow(it2_dfa),]$Sigma),
                  color = 'steelblue2', alpha = 0.6) +
    geom_line(data = dists, aes(x = x, y = y, group = Iteration)) +
    geom_vline(data = it2_dfa, aes(xintercept = Mu), color = 'steelblue2') +
    geom_point(y = 0) +
    # expand_limits(y = 0) +
    ylim(ylim) + xlim(xlim) +
    ylab('') +
    theme(axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
    transition_time(Iteration) +
    labs(title = "Iteration: {frame_time}")

p_it3a_gif <-
  animate(
    p_it3a,
    width = 480,
    height = 480,
    fps = fps,
    nframes = nrow(it2_dfa),
    renderer = magick_renderer()
  )


p_it3b <- ggplot(dfa.melt, aes(x = Iteration, y = Value, color = Parameter)) +
    geom_vline(aes(xintercept=Iteration), color="grey55") +
    geom_line()+
    facet_wrap(~ Parameter, scales = "free_y", ncol = 1) +
    xlab('Iteration') + ylab('Parameter Estimate') +
    scale_color_manual(values=c("steelblue2", "#E69F00", "black")) +
    geom_point(size=2) +
    theme(legend.position = 'none') +
    transition_reveal(Iteration)  
p_it3b_gif <- animate(p_it3b, width = 480, height = 480, fps = fps, nframes = nrow(it2_dfa), renderer = magick_renderer())

new_gif <- magick::image_append(c(p_it3a_gif[1], p_it3b_gif[1]))
for(i in 2:nrow(it2_dfa)){
    combined <- magick::image_append(c(p_it3a_gif[i], p_it3b_gif[i]))
    new_gif <- c(new_gif, combined)
}
new_gif

Accuracy of Maximum Liklihood Estimates

With the binomial illustration, we saw that the optimizaton algorithms converged precisely on the values expected with that simplistic example. The parameter value of the maximum likelihood estimates were always equivalent to the sample proportion (#successes/N), which is the best frequentist estimate. This is one reason that using the binomial distribution illustration for developing maximum likelihood ideas works well. The outcome of the ML estimation is direct and intuitive.

For the illustration of estimating mu and sigma for a normal distribution from sample data, the example with optim produced an estimate for the mean that was within a few decimal places of the expected value of 100. The precision was good. But for the standard deviation, optim underestimated the true value by a small but clear amount. It turns out that this is expected behavior of the maximum likelihood method. The distinction is the familiar N vs N-1 issue with computing sample variances as biased or unbiased estimates (respectively) of the population variance. The correction is simple, a N/(N-1) ratio. If we use the optim-derived estimate of 14.4893 and square it to produce a variance, we can apply this adjustment:

(14.4893^2)*(15/(15-1))
## [1] 224.9355

And the square root of that is now very close to the expected sigma value of 15 for our example:

sqrt(224.9355)
## [1] 14.99785

Most ML methods have good precision. In situations where they don’t, e.g., linear mixed effects modeling, an alternative called Restricted Maximum Likelihood estimation is a preferred alternative.

Conclusions

It is possible to introduce the key Maximum Likelihood concepts without a detailed mathematical treatment and use of calculus to “find” the maximum likelihood for a parameter(s). While needing to track the framework of the approach with R code, the visualizations found herein provide a way to master the general ideas in two simple situations, one with a discrete RV and the other with an continuous one.

The largest conceptual and programming challenge for the novice would be the creation of the likelihood functions with R programming. This also suggests a concern that the reader may have developed. Would one always need to create the Likelihood function for an application in order to use Maximum Likelihood methods in R, since the optim function requires a likelihood function. The general answer is not usually. These simplistic illustrations with binomial and mean/sd/normal sampling are not the kind of applications where Maximum Likelihood methods are most commonly used, e.g., generalized least squares. In those methods, the optimization algorithms already have the likelihood functions incorporated so the user does not have to define the functions.

A good next step would be to work through the application of a similar set of illustrations using ordinary least squares regression, in the accompanying tutorial on that topic (the J. Bryer tutorial cited below).

References

Fox, J. (2016). Applied regression analysis and generalized linear models (Third Edition ed.). SAGE.

Millar, R. B. (2011). Maximum likelihood estimation and inference : with examples in R, SAS, and ADMB. Wiley.

Myung, I. J. (2003). Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology, 47(1), 90-100. https://doi.org/10.1016/s0022-2496(02)00028-7

Ward, M. D., & Ahlquist, J. S. (2018). Maximum likelihood for social science : strategies for analysis (1 Edition. ed.). Cambridge University Press.

A variety of online resources facilitated the creation of this tutorial. Several of our R code approaches modeled on presentations found in these links, especially the Steele document.

Wikipedia: Maximum Likelihood Estimation

Wikipedia: Likelihood function

Bryer, J. Visual Introduction to Maximum Likelihood Estimation

Steele

Brooks-Bartlett

Coghlan

Gallistel

Huang

May

Shoemaker

Singla

Cross Validated

Cross Validated

OpenLearn: Modelling and Estimation

Statquest on YouTube

With the calculus:

Lindsey

Shiny app at Cal Poly:

Shiny Binomial

Reproducibility

Version 0.2. May 12, 2022

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] transformr_0.1.3 magick_2.7.3     visualMLE_0.9    shiny_1.7.1     
##  [5] gganimate_1.0.7  reshape2_1.4.4   psych_2.1.9      plot3Drgl_1.0.2 
##  [9] rgl_0.108.3      plot3D_1.4       dplyr_1.0.7      knitr_1.37      
## [13] bbmle_1.0.24     ggthemes_4.2.4   ggplot2_3.3.5   
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-2    ellipsis_0.3.2      class_7.3-19       
##   [4] fs_1.5.2            proxy_0.4-26        farver_2.1.0       
##   [7] DT_0.20             fansi_1.0.0         mvtnorm_1.1-3      
##  [10] lubridate_1.8.0     xml2_1.3.3          extrafont_0.17     
##  [13] mnormt_2.0.2        cachem_1.0.6        jsonlite_1.7.2     
##  [16] gt_0.3.1            broom_0.7.11        Rttf2pt1_1.3.9     
##  [19] dbplyr_2.1.1        readr_2.1.1         compiler_4.1.2     
##  [22] httr_1.4.2          backports_1.4.1     assertthat_0.2.1   
##  [25] Matrix_1.4-0        fastmap_1.1.0       later_1.3.0        
##  [28] tweenr_1.0.2        htmltools_0.5.2     prettyunits_1.1.1  
##  [31] tools_4.1.2         misc3d_0.9-1        gtable_0.3.0       
##  [34] glue_1.6.0          Rcpp_1.0.7          cellranger_1.1.0   
##  [37] jquerylib_0.1.4     pkgdown_2.0.1       vctrs_0.3.8        
##  [40] nlme_3.1-153        extrafontdb_1.0     xfun_0.29          
##  [43] stringr_1.4.0       rvest_1.0.2         mime_0.12          
##  [46] lpSolve_5.6.15      lifecycle_1.0.1     MASS_7.3-54        
##  [49] scales_1.1.1        hms_1.1.1           promises_1.2.0.1   
##  [52] parallel_4.1.2      tidyverse_1.3.1     yaml_2.2.1         
##  [55] memoise_2.0.1       sass_0.4.0          bdsmatrix_1.3-4    
##  [58] stringi_1.7.6       highr_0.9           checkmate_2.0.0    
##  [61] e1071_1.7-9         gifski_1.4.3-1      rlang_0.4.12       
##  [64] pkgconfig_2.0.3     evaluate_0.14       lattice_0.20-45    
##  [67] purrr_0.3.4         sf_1.0-5            htmlwidgets_1.5.4  
##  [70] labeling_0.4.2      cowplot_1.1.1       tidyselect_1.1.1   
##  [73] plyr_1.8.6          magrittr_2.0.1      R6_2.5.1           
##  [76] generics_0.1.1      DBI_1.1.2           pillar_1.6.4       
##  [79] haven_2.4.3         withr_2.4.3         units_0.7-2        
##  [82] tibble_3.1.6        modelr_0.1.8        crayon_1.4.2       
##  [85] KernSmooth_2.23-20  utf8_1.2.2          tmvnsim_1.0-2      
##  [88] tzdb_0.2.0          rmarkdown_2.11      progress_1.2.2     
##  [91] grid_4.1.2          readxl_1.3.1        forcats_0.5.1      
##  [94] reprex_2.0.1        digest_0.6.29       classInt_0.4-3     
##  [97] xtable_1.8-4        tidyr_1.1.4         httpuv_1.6.5       
## [100] numDeriv_2016.8-1.1 munsell_0.5.0       bslib_0.3.1        
## [103] tcltk_4.1.2