Tuesday, August 13, 2019

Categorical reparameterization with Gumbel softmax

Gumbel Distribution

The Gumbel distribution (Generalized Extreme Value distribution Type-I) is used to model the distribution of the maximum (or the minimum) of a number of samples of various distributions.

The CDF is $F(x ; \mu, \beta)=e^{-e^{-(x-\mu) / \beta}}$
the PDF is
\begin{array}{l}{\frac{1}{\beta} e^{-\left(z+e^{-z}\right)}} \\ {\text { where } z=\frac{x-\mu}{\beta}}\end{array}

The Standard Gumbel Distribution is the case when $\mu=0$ and $\beta=1$
In this case, the CDF is $F(x)= e^{-e^{(-x)}}$
the PDF is $f(x)=e^{-(x+e^{-x})}$

Since the quantile function(inverse cumulative distribution function), $Q(p)$, of a Gumbel distribution is given by
$$Q(p)=\mu-\beta \ln (-\ln (p))$$
the variate  $Q(U)$ has a Gumbel distribution with parameters $\mu$  and $\beta$, when the random variate $U \sim Uniform(0,1)$

The Gumbel Softmax Distribution 

Let $z$ be a categorical variable with class probabilities $\pi_1, \pi_2, ..., \pi_k$. And we assume categorical samples are encoded as k-dimensional one-hot vectors lying on the corners of the (k-1)-dimensional simplex, $\Delta^{k-1}$.
$\mathbb{E}_{p}[z]=\left[\pi_{1}, \ldots, \pi_{k}\right]$

The Gumbel-Max trick provides a simple and efficient way of drawing samples $z$ from a categorical distribution with class probabilities $\pi$.
z=\textrm{one_hot}(\arg \max_i [g_i + \log{\pi_i}])
where $g_1, ..., g_k $ are i.i.d samples drawn from Gumbel(0,1). Then we use softmax as a continuous differentiation approximation to $\arg\max$.

$y\in \Delta^{k-1}$ where

$$y_{i}=\frac{\exp \left(\left(\log \left(\pi_{i}\right)+g_{i}\right) / \tau\right)}{\sum_{j=1}^{k} \exp \left(\left(\log \left(\pi_{j}\right)+g_{j}\right) / \tau\right)} \quad$$ for $i=1, \ldots, k$

$\tau$ is the softmax temperature, and the Gumbel-softmax distribution density is (for derivation, see[1]):

$$p_{\pi, \tau}\left(y_{1}, \ldots, y_{k}\right)=\Gamma(k) \tau^{k-1}\left(\sum_{i=1}^{k} \pi_{i} / y_{i}^{\tau}\right)^{-k} \prod_{i=1}^{k}\left(\pi_{i} / y_{i}^{\tau+1}\right)$$

The Gumbel-Softmax distribution interpolates between discrete one-hot-encoded categorical distributions and continuous categorical densities. 
(a) For low temperatures($\tau=0.1, 0.5$)
the expected value of a Gumbel-Softmax random variable approaches the expected value of a categorical random variable with the same logits. As the temperature increases($\tau=1, 10$), the
expected value converges to a uniform distribution over the categories. (b) Samples from Gumbel Softmax distributions are identical to samples from a categorical distribution as$\tau \to 0$. At higher
temperatures, Gumbel-Softmax samples are no longer one-hot, and become uniform as $\tau \to \infty$

Gumbel-Softmax Estimator

The procedure of replacing non-differentiable categorical samples with a differentiable apprxoimatin during training as the Gumbel-Softmax estimator. Notice there is a trade-off between small temperatures where samples are close to one-hot but gradients have large variances, and large temperatures where samples are smooth(uniform-like) but variance of gradients are small.

The author suggests a annealing scheme(entropy regularization) to learn this parameter $\tau$

For scenarios in which we are constrained to sampling discrete values, we discretize $y$ using argmax but use continuous approximation in the backward pass by let $\nabla_\theta z \approx \nabla_\theta y$. This is the Straight-Through(ST) Gumbel Estimator

Related Methods:

See the picture.
(a): If $x(\theta)$ is deterministic and differentiable, $\nabla_\theta f(x)$ can be computed via backprop directly

  1. Jang, E., Brain, G., Gu, S., & Poole, B. (n.d.). Published as a conference paper at ICLR 2017 CATEGORICAL REPARAMETERIZATION WITH GUMBEL-SOFTMAX. 
  2. Wikipedia Contributors. (2019, August 12). Gumbel distribution. Wikipedia. Wikimedia Foundation. https://en.wikipedia.org/wiki/Gumbel_distribution.