Gumbel Distribution
The Gumbel distribution (Generalized Extreme Value distribution Type-I) is used to model the distribution of the maximum (or the minimum) of a number of samples of various distributions.
The CDF is $F(x ; \mu, \beta)=e^{-e^{-(x-\mu) / \beta}}$
the PDF is
$$
\begin{array}{l}{\frac{1}{\beta} e^{-\left(z+e^{-z}\right)}} \\ {\text { where } z=\frac{x-\mu}{\beta}}\end{array}
$$
The Standard Gumbel Distribution is the case when $\mu=0$ and $\beta=1$
In this case, the CDF is $F(x)= e^{-e^{(-x)}}$
the PDF is $f(x)=e^{-(x+e^{-x})}$
Since the quantile function(inverse cumulative distribution function), $Q(p)$, of a Gumbel distribution is given by
$$Q(p)=\mu-\beta \ln (-\ln (p))$$
the variate $Q(U)$ has a Gumbel distribution with parameters $\mu$ and $\beta$, when the random variate $U \sim Uniform(0,1)$
The Gumbel Softmax Distribution
Let $z$ be a categorical variable with class probabilities $\pi_1, \pi_2, ..., \pi_k$. And we assume categorical samples are encoded as k-dimensional one-hot vectors lying on the corners of the (k-1)-dimensional simplex, $\Delta^{k-1}$.
$\mathbb{E}_{p}[z]=\left[\pi_{1}, \ldots, \pi_{k}\right]$
The Gumbel-Max trick provides a simple and efficient way of drawing samples $z$ from a categorical distribution with class probabilities $\pi$.
$$
z=\textrm{one_hot}(\arg \max_i [g_i + \log{\pi_i}])
$$
where $g_1, ..., g_k $ are i.i.d samples drawn from Gumbel(0,1). Then we use softmax as a continuous differentiation approximation to $\arg\max$.
$y\in \Delta^{k-1}$ where
$$y_{i}=\frac{\exp \left(\left(\log \left(\pi_{i}\right)+g_{i}\right) / \tau\right)}{\sum_{j=1}^{k} \exp \left(\left(\log \left(\pi_{j}\right)+g_{j}\right) / \tau\right)} \quad$$ for $i=1, \ldots, k$
$\tau$ is the softmax temperature, and the Gumbel-softmax distribution density is (for derivation, see[1]):
$$p_{\pi, \tau}\left(y_{1}, \ldots, y_{k}\right)=\Gamma(k) \tau^{k-1}\left(\sum_{i=1}^{k} \pi_{i} / y_{i}^{\tau}\right)^{-k} \prod_{i=1}^{k}\left(\pi_{i} / y_{i}^{\tau+1}\right)$$
Gumbel-Softmax Estimator
The procedure of replacing non-differentiable categorical samples with a differentiable apprxoimatin during training as the Gumbel-Softmax estimator. Notice there is a trade-off between small temperatures where samples are close to one-hot but gradients have large variances, and large temperatures where samples are smooth(uniform-like) but variance of gradients are small.
The author suggests a annealing scheme(entropy regularization) to learn this parameter $\tau$
For scenarios in which we are constrained to sampling discrete values, we discretize $y$ using argmax but use continuous approximation in the backward pass by let $\nabla_\theta z \approx \nabla_\theta y$. This is the Straight-Through(ST) Gumbel Estimator
Related Methods:
See the picture.
(a): If $x(\theta)$ is deterministic and differentiable, $\nabla_\theta f(x)$ can be computed via backprop directly
(b):
Ref:
The Gumbel distribution (Generalized Extreme Value distribution Type-I) is used to model the distribution of the maximum (or the minimum) of a number of samples of various distributions.
The CDF is $F(x ; \mu, \beta)=e^{-e^{-(x-\mu) / \beta}}$
the PDF is
$$
\begin{array}{l}{\frac{1}{\beta} e^{-\left(z+e^{-z}\right)}} \\ {\text { where } z=\frac{x-\mu}{\beta}}\end{array}
$$
The Standard Gumbel Distribution is the case when $\mu=0$ and $\beta=1$
In this case, the CDF is $F(x)= e^{-e^{(-x)}}$
the PDF is $f(x)=e^{-(x+e^{-x})}$
Since the quantile function(inverse cumulative distribution function), $Q(p)$, of a Gumbel distribution is given by
$$Q(p)=\mu-\beta \ln (-\ln (p))$$
the variate $Q(U)$ has a Gumbel distribution with parameters $\mu$ and $\beta$, when the random variate $U \sim Uniform(0,1)$
The Gumbel Softmax Distribution
Let $z$ be a categorical variable with class probabilities $\pi_1, \pi_2, ..., \pi_k$. And we assume categorical samples are encoded as k-dimensional one-hot vectors lying on the corners of the (k-1)-dimensional simplex, $\Delta^{k-1}$.
$\mathbb{E}_{p}[z]=\left[\pi_{1}, \ldots, \pi_{k}\right]$
The Gumbel-Max trick provides a simple and efficient way of drawing samples $z$ from a categorical distribution with class probabilities $\pi$.
$$
z=\textrm{one_hot}(\arg \max_i [g_i + \log{\pi_i}])
$$
where $g_1, ..., g_k $ are i.i.d samples drawn from Gumbel(0,1). Then we use softmax as a continuous differentiation approximation to $\arg\max$.
$y\in \Delta^{k-1}$ where
$$y_{i}=\frac{\exp \left(\left(\log \left(\pi_{i}\right)+g_{i}\right) / \tau\right)}{\sum_{j=1}^{k} \exp \left(\left(\log \left(\pi_{j}\right)+g_{j}\right) / \tau\right)} \quad$$ for $i=1, \ldots, k$
$\tau$ is the softmax temperature, and the Gumbel-softmax distribution density is (for derivation, see[1]):
$$p_{\pi, \tau}\left(y_{1}, \ldots, y_{k}\right)=\Gamma(k) \tau^{k-1}\left(\sum_{i=1}^{k} \pi_{i} / y_{i}^{\tau}\right)^{-k} \prod_{i=1}^{k}\left(\pi_{i} / y_{i}^{\tau+1}\right)$$
Gumbel-Softmax Estimator
The procedure of replacing non-differentiable categorical samples with a differentiable apprxoimatin during training as the Gumbel-Softmax estimator. Notice there is a trade-off between small temperatures where samples are close to one-hot but gradients have large variances, and large temperatures where samples are smooth(uniform-like) but variance of gradients are small.
The author suggests a annealing scheme(entropy regularization) to learn this parameter $\tau$
For scenarios in which we are constrained to sampling discrete values, we discretize $y$ using argmax but use continuous approximation in the backward pass by let $\nabla_\theta z \approx \nabla_\theta y$. This is the Straight-Through(ST) Gumbel Estimator
Related Methods:
(a): If $x(\theta)$ is deterministic and differentiable, $\nabla_\theta f(x)$ can be computed via backprop directly
(b):
Ref:
- Jang, E., Brain, G., Gu, S., & Poole, B. (n.d.). Published as a conference paper at ICLR 2017 CATEGORICAL REPARAMETERIZATION WITH GUMBEL-SOFTMAX.
- Wikipedia Contributors. (2019, August 12). Gumbel distribution. Wikipedia. Wikimedia Foundation. https://en.wikipedia.org/wiki/Gumbel_distribution.