There are many ways we can define in statistics good estimators for a model, like the mean or variance. However, instead of relying on authority and tradition to provide such estimators it would be useful to have a principled approach.
Hence, we are going to have a look at the maximum likelihood estimation, which is a very commonly used principle to derive the sought after estimators.
Let’s start by examining $m$ independent samples $\mathbb{X} = \{\x^{(1)}...\x^{(m)}\}$ from a distribution $\pdata$, where the latter is actually not known. Then, we can write down the model distribution as:
$$\begin{equation} p_{\text{model}}: (\x;\THETA) \mapsto \hat{p}\in\mathbb{R} \;\hat{=}\; p_{data}(\mathbf{x}) \end{equation}$$where $\THETA$ is a parameter over the family of probability distributions $\pmodel{\x}$. Then the maximum likelihood estimator for $\THETA$ is defined as:
$$\begin{align} \THETA_{\text{ML}} &= \argmax_{\THETA} \pmodel{\mathbb{X}} \\ &= \argmax_{\THETA} \prod_{i=1}^{m} \pmodel{\x^{(i)}} \end{align}$$Please note, that working with the product $\prod$ is for a variety of reasons rather inconvenient. Hence, let’s transform it into a sum $\sum$ by taking the logarithm:
$$\begin{equation} \THETA_{\text{ML}} = \argmax_{\THETA} \sum_{i=1}^{m} \log\pmodel{\x^{(i)}} \end{equation}$$This works because applying the logarithm does not change the $\argmax$ for $\THETA$. We can continue transforming by dividing by $m$, which also has no effect on the $\argmax$ for $\THETA$:
$$\begin{equation} \THETA_{\text{ML}} = \argmax_{\THETA} \frac{1}{m} \sum_{i=1}^{m} \log\pmodel{\x^{(i)}} \end{equation}$$which we can then express as an expectation with respect to the empirical distribution $\hat{p}_\text{data}$:
$$\begin{equation} \THETA_{\text{ML}} = \argmax_{\THETA} \mathbb{E}_{{\x} \sim {\hat{p}_\text{data}}} \log\pmodel{\x^{(i)}} \end{equation}$$Summarized: So, by maximizing the expectation $\mathbb{E}$ over the logarithm of our model distribution $p_\text{model}$ for the given samples ${\x} \sim {\hat{p}_\text{data}}$ and with respect to the parameter $\THETA$, we tend to end up with good estimators.
This presentation largely follows Goodfellow et alia (2016). Deep Learning. USA: MIT. 128.
ReplyDelete