Summary: Maximum Likelihood Estimation, Maximum a Posteriori Estimation, MLE, MAP

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a principle that estimates the parameters of a statistical model, which makes the observed data most probable. In other words, MLE maximizes the data likelihood.

Parameter Estimation: Estimating the Probability of Heads

Let's assume we have a random variable $X$ representing a coin. We can estimate the probability that it will turn up heads ($X = 1$) or tails ($X = 0$).

Task:Estimate the probability of heads $\theta = P(X = 1)$

Evidently, if $P(X=1)=\theta$, then $P(X=0)=1-\theta$. Since we do not know the "true" probability of heads, i.e. $P(X=1) = \theta$, we will use $\hat\theta$ to refer to its estimate.

Question: What is the probability of $\theta = P(X=1)?$

In general, Maximum Likelihood Estimation principle asks to choose parameter $\theta$ that maximizes $P(Data|\theta)$, or in other words maximizes the probability of the observed data. We assume that $\theta$ belongs to the set $\Theta \subset \mathbb{R}^n$. Therefore,

$$\hat\theta_{MLE} = \underset{\theta}{\arg\max}\ P(Data|\theta)$$

In regards to our coin flip example, if we flip the coin repeatedly, we observe that:

  • It turns up heads $\alpha_1$ times
  • It turns up tails $\alpha_0$ times

Intuitively, we can estimate the $P(X=1)$ from our training data (number of tosses) as the fraction of flips that ends up heads:

$$ P(X=1) = \frac{\alpha_1}{\alpha_1 + \alpha_0}$$

For instance, if we flip the coin 40 times, seeing 18 heads and 22 tails, then we can estimate that:

$$\hat\theta = P(X=1) = \frac{18}{40} = 0.45$$

And if we flip it 5 times, observing 3 heads and 2 tails, then we have:

$$\hat\theta = P(X=1) = \frac{3}{5} = 0.6$$

How to Calculate MLE?

First step in calculating the maximum likelihood estimator $\hat\theta$ is to define $P(Data|\theta)$. If we flip the coin once, then $P(Data|\theta) = \theta$ if the flip results in heads and $P(Data|\theta) = 1 - \theta$, if the flips turns tails. If we observe $D = \{1,0,1,1,0\}$ by tossing the coin 5 times, assuming the flips are independent and identically distributed (i.i.d), then we have:

$$P(Data|\theta) = \theta\cdot(1-\theta)\cdot\theta\cdot\theta\cdot(1-\theta) = \theta^3\cdot(1-\theta)^2$$

In general, if we flip the coin $n$ times, observing $\alpha_H$ heads and $\alpha_T$ tails, then

$$P(Data|\theta) = \theta^{\alpha_H}\cdot(1-\theta)^{\alpha_T}$$

The next step is to find the value of $\theta$ that maximizes the $P(Data|\theta)$. When finding the MLE, it is often easier to maximize the log-likelihood function since,

$$\underset{\theta}{\arg\max} \log P(Data|\theta) = \underset{\theta}{\arg\max}\ P(Data|\theta)$$

Let's call $J(\theta) = \log P(Data|\theta)$. Thus, in order to find the value of the $\theta$ that maximizes the $J(\theta)$, we calculate the derivative of $J(\theta)$ with respect to $\theta$, set it to zero and solve for $\theta$.

$$\frac{\partial J(\theta)}{\partial \theta} = \frac{\partial[\alpha_H \log \theta + \alpha_T \log (1-\theta)]}{\partial \theta}= \alpha_H \frac{1}{\theta} - \alpha_T \frac{1}{1-\theta} = 0$$

Solving this for $\theta$ gives, $$\hat{\theta} = \dfrac{\alpha_H}{\alpha_H + \alpha_T}$$

Question: How good is this MLE estimation?

If I fliped the coin 5 times: 3 heads, 2 tails: $\hat{\theta}_{MLE}=\frac{3}{5}=0.6$. What if I flipped 30 heads and 35 tails? $\hat{\theta}_{MLE}=\frac{30}{65}=0.46$

Which estimator should we trust more? Let's assume that the coin is a goverment minted coin, meaning it's a fair coin and is "close" to 50-50 chance of heads/tails. It tells us that $\theta$ should be more likely about $0.5$. Therefore, we cannot quite rely on the estimate $\hat{\theta} = 0.6$.

In general, if we have plenty of data, MLE works well, but if we have a few observations such as 5 coin flips, our estimates will be unreliable. This leads us to the second principle for estimating parameters. This principle allows us to integrate our prior assumptions along with observed data to develop our ultimate estimate.

Map a Posteriori Estimation (MAP)

Considering our coin flip example, we assume that the coin is a goverment minted coin, meaning the $\theta$ is close to $0.5$. What can do we now that we have prior knowledge? How can we estimate the probability of heads?

We have prior knowledge about the coin, but particularly, at the beginning we do not have enough coin flips to estimate the probability of heads ($\theta$). However, we can add a number of imaginary coin flips. For instance, if we add 10 imaginary heads and 10 imaginary tails before we even start the first flip, $\hat{\theta}=\frac{10}{20}=0.5$, which describes our prior knowledge about the coin. After the first flip, if it turns up head $\hat{\theta}=\frac{1+10}{1+10+10}=0.52$.

We can see that as the number of coin flips increases, our final estimate becomes better, but more importantly, when we don't have plenty of flips, our estimate is still reliable. The more confident we are about our prior assumptions, the higher number of imaginary flips we can consider. Thus, we have:

$$\hat{\theta} = \dfrac{\alpha_H + \lambda_H}{(\alpha_H+\lambda_H) + (\alpha_T+\lambda_T)}$$

where $\lambda_H$ and $\lambda_T$ are imaginary (or virtual) heads and tails respectively.

Bayesian Approach

We choose a Bayesian approach, therefore, rather than estimating a single $\theta$, we obtain a distribution over possible values of $\theta$. Then, choose the value of $\theta$ that is most probable, given the observed data and prior belief.

We need Bayes rule to proceed.

Chain rule: $$P(X,Y)=P(X|Y)P(Y)=P(Y|X)P(X)$$

Bayes rule: $$P(X|Y)=\frac{P(Y|X)P(X)}{P(Y)}$$

Using the Bayes rule, we have: $$P(\theta|Data)=\frac{P(Data|\theta)P(\theta)}{P(Data)}$$

Or equivalently, $$P(\theta|Data)\propto P(Data|\theta)P(\theta)$$

We can get rid of $P(Data)$, because it's independent of the parameter $\theta$.

$P(\theta|Data)$ is called "posterior", $P(Data|\theta)$ is called "likelihood" and $P(\theta)$ is called "prior".

MLE vs. MAP

1- Maximum Likelihood estimation (MLE):

  • Choose value of $\theta$ that maximizes the probability of observed data. $$\hat{\theta}_{MLE}=\underset{\theta}{\arg\max}\ P(Data|\theta)$$

2- Maximum a posteriori (MAP) estimation:

  • Choose value of $\theta$ that is most probable given observed data and prior belief. $$ \begin{aligned} \hat{\theta}_{MAP}&=\underset{\theta}{\arg\max}\ P(\theta|Data)\\ &=\underset{\theta}{\arg\max}\ P(Data|\theta) P(\theta) \end{aligned} $$

Map Estimation for Binomial Distribution

Likelihood is Binomial:$P(Data|\theta)={n\choose \alpha_H}\theta^{\alpha_H}(1-\theta)^{\alpha_T}$

If we assume prior is Beta distribution: $P(\theta)=\frac{\theta^{\beta_H-1}(1-\theta)^\beta_T-1}{B(\beta_H,\beta_T)}\sim Beta(\beta_H, \beta_T)$

$B(x,y)=\int_o^1 t^{x-1}(1-t)^{y-1}dt$

Then, posterior is Beta distribution: $P(\theta|Data)\sim Beta(\beta_H+\alpha_H, \beta_T+\alpha_T)$

And,

$$ \begin{aligned} \hat{\theta}_{MAP}&=\underset{\theta}{\arg\max}\ P(Data|\theta) P(\theta)\\ &=\frac{\alpha_H+\beta_H -1}{\alpha_H+\beta_H+\alpha_T+\beta_T -2} \end{aligned} $$
  • Conjugate prior: $P(\theta)$ is the conjugate prior for likelihood function $P(\theta|Data)$ if $P(\theta)$ and $P(\theta|Data)$ have the same form.

  • Beta prior is equivalent to extra coin flips

  • As the number of samples (e.g. coin flips) increases, the effect of prior is "washed out". It means as $N\rightarrow \infty$, prior is "forgotten".

  • For small sample size, prior is important.

Implemented Functions

flip_coin[source]

flip_coin(num_of_experiments=1000, num_of_flips=30)

Flip the coin num_of_flips times and repeat this experiment num_of_experiments times. And return the number of heads grouped together in all the experiments.

# initialize the variables
num_of_flips = 40
num_of_experiments = 3000
head_counts = flip_coin(num_of_experiments,num_of_flips)

Let's plot a chart and see the the number of heads in 30 flips in all the experiments.

x = range(num_of_flips + 1)
source = pd.DataFrame({
  'Number of Heads': x,
  'Number of Ways': head_counts
})
#     Bar chart
bar_chart = alt.Chart(source).mark_bar().encode(
    x='Number of Heads',
    y='Number of Ways',
).properties(title='Distribution of Heads',width=360)
bar_chart

What does the plot look like? Does the bell shape ring a bell?

Now, we plot a Normal distribution with mean = number_of_flips and standard deviation = $\sqrt{mean/2}$ and check how it looks like.

from scipy.stats import norm

x = np.arange(0,num_of_flips)
mean = num_of_flips / 2
stddev = np.sqrt(mean / 2)
y = norm.pdf(x,mean,stddev)

data = pd.DataFrame({
    'x': x,
    'y': y
})
normal_chart = alt.Chart(data).mark_line(color='red').encode(
    alt.X('x'),
    y = 'y'
).properties(title='Normal Distribution',width=360)
normal_chart | bar_chart

By comparing the charts we realize that

  • As the sample size become larger, the distribution of samples approximate a normal distribution.

  • As we flip the coin repeatedly, number of heads and tails are getting equal because the number of heads has been peaked around 20 flips out of 40, which indicates that the probability of heads is close to $0.5$.

References

[1] Tom Mitchell Estimating Probabilities