A lot of machine learning resources tend to be vague in giving definitions. This is because the authors don't know these concepts well enough.
So, what is a cross-entropy?
KL divergence(discrete form):
$KL(p||q)=\sum_{k=1}^K p_k log (\frac{p_k}{q_k})$
$KL(p||q)=\sum_k p_k log p_k -\sum_k p_k log q_k=-H(p)+ H(p,q)$
Here, $H(p)$ is the entropy for distribution $p$, $H(p,q)$ is the cross entropy between distribution $p$ and $q$, notice cross-entropy, like KL divergence, is asymmetric.
According to Cover and Thomas 2006, cross entropy is the average number of bits needed to encode data coming from a source with a distribution $p$ when we use model $q$ to define our cookbook.
Hence the "regular" entropy is $H(p)=H(p,p)$.
So what is a cross-entropy loss function, then?
Be patient, look at where does a maximum log likelihood come from.
You may have seen the derivation of MLE (Maximum Likelihood Estimation) several times. You assume:
So, what is a cross-entropy?
KL divergence(discrete form):
$KL(p||q)=\sum_{k=1}^K p_k log (\frac{p_k}{q_k})$
$KL(p||q)=\sum_k p_k log p_k -\sum_k p_k log q_k=-H(p)+ H(p,q)$
Here, $H(p)$ is the entropy for distribution $p$, $H(p,q)$ is the cross entropy between distribution $p$ and $q$, notice cross-entropy, like KL divergence, is asymmetric.
According to Cover and Thomas 2006, cross entropy is the average number of bits needed to encode data coming from a source with a distribution $p$ when we use model $q$ to define our cookbook.
Hence the "regular" entropy is $H(p)=H(p,p)$.
So what is a cross-entropy loss function, then?
Be patient, look at where does a maximum log likelihood come from.
You may have seen the derivation of MLE (Maximum Likelihood Estimation) several times. You assume:
- Data is i.i.d distributed (recent years we have seen researches on non-i.i.d. data, but not for this article)
- $X={x^{(1)},..., x^{(m)}}$
- $p_{model}(x;\theta)$ is a parametric family of probability distributions over the data
$\theta_{ML}=argmax_{\theta} p_{model}(X; \theta)=argmax_{\theta} \sum_{i=1}^m p_{model}(x^{(i)}; \theta)$
It doesn't matter if you take the log likelihood because they are all positive:
$argmax_{\theta} \sum_{i=1}^m log( p_{model} (x^{(i)}; \theta))$
Divided by $m$, it becomes:
$\theta_{ML}=argmax_{\theta} E_{x \sim \tilde{p}_{data}} log p_{model} (x; \theta)$
Notice:
$KL(\tilde{p}_{data} || p_{model})=E_{x \sim \tilde{p}_{data}} [log \tilde{p}_{data}(x) - log p_{model}(x) ]$
Minimizing over $KL(\tilde{p}_{data} || p_{model})$ w.r.t our model is equivalent to minimizing for the cross entropy term:
$-E_{x \sim \tilde{p}_{data}} [log p_{model} (x)]=H(\tilde{p}_{data} || p_{model})$
This is also equivalent to MLE above. In fact, any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution and probability distribution defined by the model. e.g., mean squared error is the cross-entropy between the empirical distribution and a Gaussian model. The term "cross-entropy" used to refer negative log-likelihood (NLL) for a Bernoulli(logistic regression) or softmax distribution is a misnomer because cross-entropy is in fact used in machine learning wherever there is a maximum likelihood.
--------------------------------
2019.5.18 Note:
For many discriminative models, the above formulas aren't very accurate. Concretely:
--------------------------------
2019.5.18 Note:
For many discriminative models, the above formulas aren't very accurate. Concretely:
- $X=\{x^{(i)}\}_{i=1}^n \times Y=\{y^{(i)}\}_{i=1}^n \sim \tilde{p}_{data}$
- $$\theta_{ML}=argmax_{\theta} p_{model}(Y|X, \theta)=argmax_{\theta} \sum_{i=1}^m p_{model}(y^{(i)} | x^{(i)}, \theta)$$
- $\theta_{ML}=argmax_{\theta} E_{x,y \sim \tilde{p}_{data}}[ log p_{model} (y | x, \theta)]$
- Then it can seen as minimize cross-entropy: $-E_{x,y \sim \tilde{p}_{data}}[ log p_{model} (y | x, \theta)]=H(\tilde{p}_{data}(y|x) || p_{model}(y|x, \theta))$
- Or you can see it from KL divergence perspective: $argmin_{\theta} KL(\tilde{p}_{data}(y|x) || p_{model}(y|x))=argmin_{\theta} E_{x,y \sim \tilde{p}_{data}} [log \tilde{p}_{data}(y|x) - log p_{model}(y|x) ]=argmin_{\theta} H(\tilde{p}_{data}(y|x) || p_{model}(y|x, \theta))$
Reference:
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
- Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
No comments:
Post a Comment