**Gumbel Distribution**

**The Gumbel distribution (Generalized Extreme Value distribution Type-I) is used to model the distribution of the maximum (or the minimum) of a number of samples of various distributions.**

The CDF is $F(x ; \mu, \beta)=e^{-e^{-(x-\mu) / \beta}}$

the PDF is

$$

\begin{array}{l}{\frac{1}{\beta} e^{-\left(z+e^{-z}\right)}} \\ {\text { where } z=\frac{x-\mu}{\beta}}\end{array}

$$

The

**Standard Gumbel Distribution**is the case when $\mu=0$ and $\beta=1$

In this case, the CDF is $F(x)= e^{-e^{(-x)}}$

the PDF is $f(x)=e^{-(x+e^{-x})}$

Since the quantile function(inverse cumulative distribution function), $Q(p)$, of a Gumbel distribution is given by

$$Q(p)=\mu-\beta \ln (-\ln (p))$$

the variate $Q(U)$ has a Gumbel distribution with parameters $\mu$ and $\beta$, when the random variate $U \sim Uniform(0,1)$

**The Gumbel Softmax Distribution**

**Let $z$ be a categorical variable with class probabilities $\pi_1, \pi_2, ..., \pi_k$. And we assume categorical samples are encoded as k-dimensional one-hot vectors lying on the corners of the (k-1)-dimensional simplex, $\Delta^{k-1}$.**

$\mathbb{E}_{p}[z]=\left[\pi_{1}, \ldots, \pi_{k}\right]$

The Gumbel-Max trick provides a simple and efficient way of drawing samples $z$ from a categorical distribution with class probabilities $\pi$.

$$

z=\textrm{one_hot}(\arg \max_i [g_i + \log{\pi_i}])

$$

where $g_1, ..., g_k $ are i.i.d samples drawn from Gumbel(0,1). Then we use softmax as a continuous differentiation approximation to $\arg\max$.

$y\in \Delta^{k-1}$ where

$$y_{i}=\frac{\exp \left(\left(\log \left(\pi_{i}\right)+g_{i}\right) / \tau\right)}{\sum_{j=1}^{k} \exp \left(\left(\log \left(\pi_{j}\right)+g_{j}\right) / \tau\right)} \quad$$ for $i=1, \ldots, k$

$\tau$ is the softmax temperature, and the Gumbel-softmax distribution density is (for derivation, see[1]):

$$p_{\pi, \tau}\left(y_{1}, \ldots, y_{k}\right)=\Gamma(k) \tau^{k-1}\left(\sum_{i=1}^{k} \pi_{i} / y_{i}^{\tau}\right)^{-k} \prod_{i=1}^{k}\left(\pi_{i} / y_{i}^{\tau+1}\right)$$

**Gumbel-Softmax Estimator**

The procedure of replacing non-differentiable categorical samples with a differentiable apprxoimatin during training as the

*Gumbel-Softmax estimator*. Notice there is a trade-off between small temperatures where samples are close to one-hot but gradients have large variances, and large temperatures where samples are smooth(uniform-like) but variance of gradients are small.

The author suggests a annealing scheme(entropy regularization) to learn this parameter $\tau$

For scenarios in which we are constrained to sampling discrete values, we discretize $y$ using

*argmax*but use continuous approximation in the backward pass by let $\nabla_\theta z \approx \nabla_\theta y$. This is the Straight-Through(ST) Gumbel Estimator

**Related Methods:**

(a): If $x(\theta)$ is deterministic and differentiable, $\nabla_\theta f(x)$ can be computed via backprop directly

(b):

Ref:

- Jang, E., Brain, G., Gu, S., & Poole, B. (n.d.). Published as a conference paper at ICLR 2017 CATEGORICAL REPARAMETERIZATION WITH GUMBEL-SOFTMAX.
- Wikipedia Contributors. (2019, August 12). Gumbel distribution. Wikipedia. Wikimedia Foundation. https://en.wikipedia.org/wiki/Gumbel_distribution.