Friday, November 30, 2018

Maximum Likelihood Estimation

$$ \newcommand{\THETA}{ \boldsymbol\theta } \newcommand{\x}{ \boldsymbol{x} } $$
$$ \newcommand{\pdata}{ p_{\text{data}}(\x) } \newcommand{\pmodel}[2][;\THETA]{ p_{\text{model}}(#2#1) } $$
$$ \DeclareMathOperator*{\argmax}{ arg\,max } $$

There are many ways we can define in statistics good estimators for a model, like the mean or variance. However, instead of relying on authority and tradition to provide such estimators it would be useful to have a principled approach.

Hence, we are going to have a look at the maximum likelihood estimation, which is a very commonly used principle to derive the sought after estimators.

Let’s start by examining $m$ independent samples $\mathbb{X} = \{\x^{(1)}...\x^{(m)}\}$ from a distribution $\pdata$, where the latter is actually not known. Then, we can write down the model distribution as:

$$\begin{equation} p_{\text{model}}: (\x;\THETA) \mapsto \hat{p}\in\mathbb{R} \;\hat{=}\; p_{data}(\mathbf{x}) \end{equation}$$

where $\THETA$ is a parameter over the family of probability distributions $\pmodel{\x}$. Then the maximum likelihood estimator for $\THETA$ is defined as:

$$\begin{align} \THETA_{\text{ML}} &= \argmax_{\THETA} \pmodel{\mathbb{X}} \\ &= \argmax_{\THETA} \prod_{i=1}^{m} \pmodel{\x^{(i)}} \end{align}$$

Please note, that working with the product $\prod$ is for a variety of reasons rather inconvenient. Hence, let’s transform it into a sum $\sum$ by taking the logarithm:

$$\begin{equation} \THETA_{\text{ML}} = \argmax_{\THETA} \sum_{i=1}^{m} \log\pmodel{\x^{(i)}} \end{equation}$$

This works because applying the logarithm does not change the $\argmax$ for $\THETA$. We can continue transforming by dividing by $m$, which also has no effect on the $\argmax$ for $\THETA$:

$$\begin{equation} \THETA_{\text{ML}} = \argmax_{\THETA} \frac{1}{m} \sum_{i=1}^{m} \log\pmodel{\x^{(i)}} \end{equation}$$

which we can then express as an expectation with respect to the empirical distribution $\hat{p}_\text{data}$:

$$\begin{equation} \THETA_{\text{ML}} = \argmax_{\THETA} \mathbb{E}_{{\x} \sim {\hat{p}_\text{data}}} \log\pmodel{\x^{(i)}} \end{equation}$$

Summarized: So, by maximizing the expectation $\mathbb{E}$ over the logarithm of our model distribution $p_\text{model}$ for the given samples ${\x} \sim {\hat{p}_\text{data}}$ and with respect to the parameter $\THETA$, we tend to end up with good estimators.

1 comment:

  1. This presentation largely follows Goodfellow et alia (2016). Deep Learning. USA: MIT. 128.

    ReplyDelete