Processing math: 0%

Friday, November 30, 2018

Maximum Likelihood Estimation

\newcommand{\THETA}{ \boldsymbol\theta } \newcommand{\x}{ \boldsymbol{x} }
\newcommand{\pdata}{ p_{\text{data}}(\x) } \newcommand{\pmodel}[2][;\THETA]{ p_{\text{model}}(#2#1) }
\DeclareMathOperator*{\argmax}{ arg\,max }

There are many ways we can define in statistics good estimators for a model, like the mean or variance. However, instead of relying on authority and tradition to provide such estimators it would be useful to have a principled approach.

Hence, we are going to have a look at the maximum likelihood estimation, which is a very commonly used principle to derive the sought after estimators.

Let’s start by examining m independent samples \mathbb{X} = \{\x^{(1)}...\x^{(m)}\} from a distribution \pdata, where the latter is actually not known. Then, we can write down the model distribution as:

\begin{equation} p_{\text{model}}: (\x;\THETA) \mapsto \hat{p}\in\mathbb{R} \;\hat{=}\; p_{data}(\mathbf{x}) \end{equation}

where \THETA is a parameter over the family of probability distributions \pmodel{\x}. Then the maximum likelihood estimator for \THETA is defined as:

\begin{align} \THETA_{\text{ML}} &= \argmax_{\THETA} \pmodel{\mathbb{X}} \\ &= \argmax_{\THETA} \prod_{i=1}^{m} \pmodel{\x^{(i)}} \end{align}

Please note, that working with the product \prod is for a variety of reasons rather inconvenient. Hence, let’s transform it into a sum \sum by taking the logarithm:

\begin{equation} \THETA_{\text{ML}} = \argmax_{\THETA} \sum_{i=1}^{m} \log\pmodel{\x^{(i)}} \end{equation}

This works because applying the logarithm does not change the \argmax for \THETA. We can continue transforming by dividing by m, which also has no effect on the \argmax for \THETA:

\begin{equation} \THETA_{\text{ML}} = \argmax_{\THETA} \frac{1}{m} \sum_{i=1}^{m} \log\pmodel{\x^{(i)}} \end{equation}

which we can then express as an expectation with respect to the empirical distribution \hat{p}_\text{data}:

\begin{equation} \THETA_{\text{ML}} = \argmax_{\THETA} \mathbb{E}_{{\x} \sim {\hat{p}_\text{data}}} \log\pmodel{\x^{(i)}} \end{equation}

Summarized: So, by maximizing the expectation \mathbb{E} over the logarithm of our model distribution p_\text{model} for the given samples {\x} \sim {\hat{p}_\text{data}} and with respect to the parameter \THETA, we tend to end up with good estimators.

1 comment:

  1. This presentation largely follows Goodfellow et alia (2016). Deep Learning. USA: MIT. 128.

    ReplyDelete