Transforming Simple Distribution with Normalizing Flow

build up theoritical understanding of normalizing flow model

Overview

Modeling a probability distribution has broad application in image generation, video interpolation, file compression an so on. In this article we will talk about one of interesting method to model a probability distribution called normalizing flow. In a nutshell, normalizing flow is a method to model a probability distribution by deforming a random variable to a simple predefined independent prior distribution, such as Gaussian to a more complex one. Given a random variable $\boldsymbol{X}\in \mathbb{R}^D$ from unknown complex distribution, normalizing flow model the distribution by train an invertible parametric transformation $f$ that map $\boldsymbol{X}$to $\boldsymbol{Z}$. During inference to get sample $\boldymbol{x}$, simply sample $\boldsymbol{z}$ then transform it by $f^{-1}$. To more precise, let $\boldsymbol{x}, \boldsymbol{z}$ are samples from $\boldsymbol{X}$ and $\boldsymbol{Y}$ by change variable theorem :

$p_X(\boldsymbol{x}) = p_Z(f(\boldsymbol{x}))|\text{det}\frac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}| $

This way it is easy to sample $\boldsymbol{x}$, by use this following step: $\begin{aligned}\boldsymbol{z}&\sim p_Z(\boldsymbol{z})\\ \boldsymbol{x}&=f^{-1}(\boldsymbol{z})\end{aligned}$ To add expressiveness of the model, $f$ is designed to be a composition of invertible functions $f=f_S\circ f_n\circ f_{n-1}\circ f_{n-2},\dots, f_1$ and let $\boldsymbol{y}_0=\boldsymbol{x}, \boldsymbol{y}_i=f_i(\boldsymbol{y}_{i-1})$. $f_S$ will be special layer that give relative importance the the output, we will talk more about this in next section. By the change variable rule we have

$$p_X(\boldsymbol{x}) = p_{Y_n}(\boldsymbol{y}_n)\left|\text{det}\frac{\partial \boldsymbol{y}_n}{\partial \boldsymbol{x}}\right|=p_{Y_n}(\boldsymbol{y}_n)\Pi_{i=1}^{n}\left|\text{det}\frac{\partial \boldsymbol{y}_i}{\partial\boldsymbol{y_{i-1}}}\right|$$

General Coupling Layer

The function $f_i$ map $\boldsymbol{y}_{i-1}$ to $\boldsymbol{y}_i$. Coupling layer is one choice of $f_i$ that was proposed in paper NICE the idea is to design $f_i$ such that Jacobians are easy to compute. Before going further for simplicity let just think about the first layer and drop the subscript from $\boldsymbol{y}_i$ to $\boldsymbol{y}$. The idea begin by splitting $\boldsymbol{x}$ into two partitions $\boldsymbol{x}_{A}\in \mathbb{R}^{d} $ and $\boldsymbol{x}_{B}\in \mathbb{R}^{D-d}$. then apply the following mapping

$$\begin{aligned}\boldsymbol{y}_{A}&= \boldsymbol{x}_{A}\\ \boldsymbol{y}_{B}&= g(\boldsymbol{x}_{B};m(\boldsymbol{x}_{A}))\end{aligned}$$

Above mapping is called coupling layer with arbitrary function $m$. The choice of $m$ can be anything such as artificial neural network.

It is easy to show that

$$\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \begin{bmatrix}\boldsymbol{I}_d & \boldsymbol{0}_{d\times D-d} \\ \frac{\partial{\boldsymbol{y}_{B}}}{\partial \boldsymbol{x}_{A}} & \frac{\partial \boldsymbol{y}_{B}}{\partial \boldsymbol{x}_{B}}\end{bmatrix}$$

Additive Coupling Layer

Additive coupling layer is the coupling layer with

$$g(\boldsymbol{x}_{B};m(\boldsymbol{x}_{A})) = \boldsymbol{x}_{B}+ m(\boldsymbol{x}_{A})$$

since $\frac{\partial \boldsymbol{y}_{B}}{\partial \boldsymbol{x}_{B}}=\frac{\partial g(\boldsymbol{x}_{B};m(\boldsymbol{x}_{A}))}{\partial \boldsymbol{x}_B}$ hence the jacobian become

$$\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}= \begin{bmatrix}I_d & \boldsymbol{0}_{d\times{D-d}} \\ \frac{\partial{\boldsymbol{y}_{B}}}{\partial \boldsymbol{x}_{A}} & I_{D-d} \end{bmatrix}$$

Which is a lower triangular matrix, now we can easily compute jacobian $J=\left| \text{det}\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} \right|$ . Remember from linear algebra determinant of lower triangular matrix are the product of it’s diagonal entries (if you forget it please read your linear algebra book) and since our all diagonal entries equals to $1$ so $J=1$. It is easy to see recursively that $p_X(\boldsymbol{x}) = p_{Y_{n}}(\boldsymbol{y}_n)$.

Since one part of coupling layer input partition is left unchanged (mapped to identity function), when composing coupling layers we should change the role between those two input partitions. To add more expressiveness to the model on the top of layer, multiply $\boldsymbol{y}_n$ with learnable diagonal matrix $\boldsymbol{S}\in \mathbb{R}^{DxD}$ , this way it allow to control relative importance for each elements of $\boldsymbol{y}_n$ so the final layer become $\begin{aligned}\boldsymbol{z}&=\boldsymbol{S}\boldsymbol{y}_n \end{aligned}$ . From vector calculus it is easy to show that $\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{y}_n}=\boldsymbol{S}$. This also cause $\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}}=\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{y}_n}\frac{\partial \boldsymbol{y}_n}{\partial \boldsymbol{x}}=\boldsymbol{S}\boldsymbol{I}_D=\boldsymbol{S}$ so we get final equation as follow $$p_X(\boldsymbol{x}) = p_{Z}(\boldsymbol{z})\left|\text{det}\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}}\right|=p_{Z}(\boldsymbol{z})\left|\text{det}\boldsymbol{S}\right|$$

Loss Function

The loss function is simply a log likelihood of our unknown distribution that generate our dataset$\text{log} (p_X(\boldsymbol{x}))$ and since prior distribution $p_Z(\boldsymbol{z})$ is independent then the loss function can be factorized into:

$$\begin{aligned}\text{log} (p_X(\boldsymbol{x}))&= \text{log}(p_Z(\boldsymbol{z}))+\text{log}\left|\text{det}\boldsymbol{S}\right|\\&=\sum_{i=1}^{D}\text{log}(p_{Z_i}(\boldsymbol{z}))+\sum_{i=1}^{D}\text{log}\left|S_{ii}\right|\\&=\sum_{i=1}^{D}\left[ \text{log}(p_{Z_i}(\boldsymbol{z}))+\text{log}\left|S_{ii}\right|\right] \end{aligned}$$

Errors and Correction

Please email me at kkrzkrk@gmail.com

Citations and Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 2.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: “Figure from …”.

For attribution in academic contexts, please cite this work as

Arpiandi, Kiki Rizki, "Transforming Simple Distribution with Normalizing Flow", 2022.

BibTeX citation

@article
{ 
  kiki2022flow,
  author = {Arpiandi, Kiki Rizki},
  title = { Transforming Simple Distribution with Normalizing Flow },
  year = {2022},
  url = {https://kikirizki.github.io/gan.html}
}