There are many ways we can define in statistics good estimators for a model, like the mean or variance. However, instead of relying on authority and tradition to provide such estimators it would be useful to have a principled approach.
Hence, we are going to have a look at the maximum likelihood estimation, which is a very commonly used principle to derive the sought after estimators.
Let’s start by examining m independent samples X={x(1)...x(m)} from a distribution pdata(x), where the latter is actually not known. Then, we can write down the model distribution as:
pmodel:(x;θ)↦ˆp∈Rˆ=pdata(x)where θ is a parameter over the family of probability distributions pmodel(x;θ). Then the maximum likelihood estimator for θ is defined as:
θML=argmaxθpmodel(X;θ)=argmaxθm∏i=1pmodel(x(i);θ)Please note, that working with the product ∏ is for a variety of reasons rather inconvenient. Hence, let’s transform it into a sum ∑ by taking the logarithm:
θML=argmaxθm∑i=1logpmodel(x(i);θ)This works because applying the logarithm does not change the argmax for θ. We can continue transforming by dividing by m, which also has no effect on the argmax for θ:
θML=argmaxθ1mm∑i=1logpmodel(x(i);θ)which we can then express as an expectation with respect to the empirical distribution ˆpdata:
θML=argmaxθEx∼ˆpdatalogpmodel(x(i);θ)Summarized: So, by maximizing the expectation E over the logarithm of our model distribution pmodel for the given samples x∼ˆpdata and with respect to the parameter θ, we tend to end up with good estimators.
This presentation largely follows Goodfellow et alia (2016). Deep Learning. USA: MIT. 128.
ReplyDelete