一、最大似然

我们有观察到随机变量\(X = \{X_1, X_2, \cdots, X_n\}\),并且知道它们的概率依赖于未知的参数\(\theta\)。我们想要找到对\(\theta\)的一个比较好的估计。

一个假设是如果\(\theta_i\)能够最大化我们已经观察到的所有随机变量的概率值,那么它大概率就是我们需要寻找的位置参数的值。即便不是,也肯定是一个非常好的近似。

我们已经观察到的所有随机变量的值可以构成一个似然函数

\[ \begin{aligned} L(\theta) &= P(X_1=x_1, X_2=x_2, \cdots, X_n = x_n) \\ &= p(x_1|\theta)\cdot p(x_2|\theta)\cdots p(x_n|\theta) \\ &= \prod_{i=1}^n p(x_i|\theta) \end{aligned} \]

我们可以取对数把乘法变成加法,得到对数似然函数(log likelihood function)

\[ \begin{aligned} l(\theta) &= \log L(\theta) \\ &= \log \prod_{i=1}^n p(x_i|\theta) \\ &= \sum_{i=1}^n \log p(x_i|\theta) \end{aligned} \]

这样,寻找未知的\(\theta\)的解决方案就是我们最大化\(l(\theta)\)

比如有位置的变量\(\theta\),我们观察到如下随机变量的值

X 0 1 2 3
p(X) \(\frac{3\theta}{4}\) \(\frac{\theta}{4}\) \(\frac{2(1-\theta)}{3}\) \(\frac{1-\theta}{3}\)

对应的似然函数为

\[ \begin{aligned} L(\theta) &= \frac{3\theta}{4} * \frac{\theta}{4} * \frac{2(1-\theta)}{3} * \frac{1-\theta}{3} \\ & = \frac{1}{24}\theta^2(1-\theta)^2 \end{aligned} \]

对数似然函数为

\[ \begin{aligned} l(\theta) &= \log L(\theta) \\ &= \log \frac{1}{24} + 2 \log \theta + 2 \log (1-\theta) \end{aligned} \]

\(\theta\)求导有

\[ \frac{d l(\theta)}{d \theta} = \frac{2}{\theta} - \frac{2}{1-\theta} \]

要让\(l(\theta)\)最大,我们可以让导数等于0,这样就可以得到\(\theta = \frac{1}{2}\)

这样,我们就通过最大似然函数得到位置变量的一个估计。

最大似然背后的直观解释就是如果一个模型能够生成和训练数据集很像的数据,那么生成的不在训练数据集里面的内容也应该和训练数据集里面的内容很像

二、独立随机变量和卷积的关系

基本结论

两个独立随机变量的和的概率密度函数是它们各自的分布的卷积。

2.1 离散情形

\(X\)\(Y\) 是两个相互独立的离散随机变量,其概率质量函数(PMF)分别为 \(p_X(k)\)\(p_Y(k)\)

\(Z = X + Y\),那么 \(Z\) 的概率质量函数为:

\[ p_Z(n) = \sum_{k=-\infty}^{\infty} p_X(k) \cdot p_Y(n - k) \]

这就是 离散卷积 的定义。

2.2 连续情形

\(X\)\(Y\) 是两个相互独立的连续随机变量,其概率密度函数(PDF)分别为 \(f_X(x)\)\(f_Y(x)\)

\(Z = X + Y\),那么 \(Z\) 的概率密度函数为:

\[ f_Z(z) = \int_{-\infty}^{\infty} f_X(x) \cdot f_Y(z - x)\,dx \]

这就是 连续卷积 的定义。

三、正态分布

正太分布卷积

当两个随机变量遵循不同的正态分布时,我们研究随机变量之和的分布情况。

一维正态分布的概率密度函数表示如下:

\[ p(x)= \mathcal{N}(x | \mu,\sigma^2)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

其中,\(\mu\in \mathbb{R}\)是分布的平均值;\(\sigma\in \mathbb{R}\)是标准差。

\[ \begin{aligned} \mu &= \mathbb{E}[x] = \int p(x)xdx \\ \sigma^2 &= \mathbb{E}[(x-\mu)^2] = \int p(x)(x-\mu)^2 dx \end{aligned} \]

下面我们来看两个正态分布\(\mathcal{N}(\cdot | 0, \sigma_1^2)\)\(\mathcal{N}(\cdot|z, \sigma_2^2)\)的卷积

\[ \begin{aligned} &\left(p_\mathcal{N}(\cdot | 0, \sigma_1^2) \ast p_\mathcal{N}(\cdot | z, \sigma_2^2)\right)(x) \\ = &\int p_\mathcal{N}(x-y | 0, \sigma_1^2) p_\mathcal{N}(y | z, \sigma_2^2) dy \\ = &\int \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp \left(-\frac{(x-y)^2}{2\sigma_1^2}\right) \frac{1}{\sqrt{2\pi\sigma_2^2}} \exp \left(-\frac{(y-z)^2}{2\sigma_2^2}\right) dy \\ = &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \int \exp \left( -\frac{(x-y)^2}{2\sigma_1^2}-\frac{(y-z)^2}{2\sigma_2^2} \right) dy \\ = &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \int \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( \sigma_2^2 (y-x)^2 + \sigma_1^2 (y-z)^2 \right) \right) dy \\ = &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \int \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( \sigma_2^2 (y^2 - 2xy + x^2) + \sigma_1^2 (y^2 - 2zy + z^2) \right) \right) dy \\ = &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \int \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( (\sigma_1^2 + \sigma_2^2)y^2 - 2(\sigma_2^2 x + \sigma_1^2 z)y + \sigma_2^2 x^2 + \sigma_1^2 z^2 \right) \right) dy \\ = &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \int \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( (\sigma_1^2 + \sigma_2^2) \left(y - \frac{\sigma_2^2 x + \sigma_1^2 z}{\sigma_1^2 + \sigma_2^2}\right)^2 - \frac{\left( \sigma_2^2 x + \sigma_1^2 z \right)^2}{\sigma_1^2 + \sigma_2^2} + \sigma_2^2 x^2 + \sigma_1^2 z^2 \right) \right) dy \\ = &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( - \frac{\left( \sigma_2^2 x + \sigma_1^2 z \right)^2}{\sigma_1^2 + \sigma_2^2} + \sigma_2^2 x^2 + \sigma_1^2 z^2 \right) \right) \int \exp \left( -\frac{\sigma_1^2 + \sigma_2^2}{2\sigma_1^2 \sigma_2^2} \left(y - \frac{\sigma_2^2 x + \sigma_1^2 z}{\sigma_1^2 + \sigma_2^2}\right)^2 \right) dy \\ = &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( - \frac{\left( \sigma_2^2 x + \sigma_1^2 z \right)^2}{\sigma_1^2 + \sigma_2^2} + \sigma_2^2 x^2 + \sigma_1^2 z^2 \right) \right) \sqrt{2\pi \frac{\sigma_1^2 \sigma_2^2}{\sigma_1^2 + \sigma_2^2}} \\ = &\frac{1}{\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}} \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( - \frac{\left( \sigma_2^2 x + \sigma_1^2 z \right)^2}{\sigma_1^2 + \sigma_2^2} + \sigma_2^2 x^2 + \sigma_1^2 z^2 \right) \right) \\ = &\frac{1}{\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}} \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2(\sigma_1^2 + \sigma_2^2)} \left( - \left( \sigma_2^2 x + \sigma_1^2 z \right)^2 + (\sigma_1^2 + \sigma_2^2)\sigma_2^2 x^2 + (\sigma_1^2 + \sigma_2^2)\sigma_1^2 z^2 \right) \right) \\ = &\frac{1}{\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}} \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2(\sigma_1^2 + \sigma_2^2)} \left( -2 \sigma_1^2\sigma_2^2 xz + \sigma_1^2\sigma_2^2 x^2 + \sigma_1^2\sigma_2^2 z^2 \right) \right) \\ = &\frac{1}{\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}} \exp \left( -\frac{1}{2(\sigma_1^2 + \sigma_2^2)} \left( -2 xz + x^2 + z^2 \right) \right) \\ = &\frac{1}{\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}} \exp \left( -\frac{(x-z)^2}{2(\sigma_1^2 + \sigma_2^2)} \right) \\ = &p_\mathcal{N}(x | z, \sigma_1^2 + \sigma_2^2) \end{aligned} \]

加法叠加性

我们再来看\(x_1 \sim \mathcal{N}(\mu_1, \sigma_1^2)\)\(x_2 \sim \mathcal{N}(\mu_2, \sigma_2^2)\)。如果\(x = x_1 + x_2\),使用卷积计算其分布有

\[ \begin{aligned} p(x) &= \int p_\mathcal{N}(x_1|\mu_1, \sigma_1^2)p_\mathcal{N}(x-x_1|\mu_2, \sigma_2^2) dx_1 \\ &= \int p_\mathcal{N}(x-(x - x_1+\mu_1)|0, \sigma_1^2)p_\mathcal{N}(x-x_1+\mu_1|\mu_1+\mu_2, \sigma_2^2) dx_1 \\ &= \int p_\mathcal{N}(x - y|0, \sigma_1^2)p_\mathcal{N}(y|\mu_1+\mu_2, \sigma_2^2) d y \\ &= p_\mathcal{N}(x|\mu_1+\mu_2, \sigma_1^2+\sigma_2^2) \end{aligned} \]

也就是两个服从正太分布的随机变量的和也服从正态分布

\[ x_1+x_2 \sim \mathcal{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2) \]

推广到高维数据,我们其实有一样的结论。

首先计算\(p_\mathcal{N}(\cdot|0, \Sigma_1) * p_\mathcal{N}(\cdot|z, \Sigma_2)\)

\[ \begin{aligned} &\left(p_\mathcal{N}(\cdot | 0, \Sigma_1) \ast p_\mathcal{N}(\cdot | z, \Sigma_2)\right)(x) \\ = &\int p_\mathcal{N}(x-y | 0, \Sigma_1) p_\mathcal{N}(y | z, \Sigma_2) dy \\ = &\int \frac{1}{(2\pi)^{N/2}|\Sigma_1|^{1/2}} \exp \left(-\frac{1}{2}(x-y)^\top\Sigma_1^{-1}(x-y)\right) \frac{1}{(2\pi)^{N/2}|\Sigma_2|^{1/2}} \exp \left(-\frac{1}{2}(y-z)^\top\Sigma_2^{-1}(y-z)\right) dy \\ = &\frac{1}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}}\int \exp \left(-\frac{1}{2}\left( (x-y)^\top\Sigma_1^{-1}(x-y) + (y-z)^\top\Sigma_2^{-1}(y-z) \right)\right) dy \\ = &\frac{1}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}}\int \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x - x^\top\Sigma_1^{-1} y - y^\top\Sigma_1^{-1} x + y^\top\Sigma_1^{-1} y + y^\top\Sigma_2^{-1} y - y^\top\Sigma_2^{-1} z - z^\top\Sigma_2^{-1} y + z^\top\Sigma_2^{-1} z \right)\right) dy \\ = &\frac{1}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}}\int \exp \left(-\frac{1}{2}\left( y^\top(\Sigma_1^{-1}+\Sigma_2^{-1})y - (x^\top\Sigma_1^{-1}+z^\top\Sigma_2^{-1})y - y^\top(\Sigma_1^{-1} x+\Sigma_2^{-1} z) + x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z \right)\right) dy \\ = &\frac{1}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}}\int \exp \left(-\frac{1}{2}\left( y^\top(\Sigma_1^{-1}+\Sigma_2^{-1})y - (\Sigma_1^{-1} x+\Sigma_2^{-1} z)^\top y - y^\top(\Sigma_1^{-1} x+\Sigma_2^{-1} z) + x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z \right)\right) dy \\ = &\frac{1}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}}\int \exp \left(-\frac{1}{2}\left( \left( y - (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top(\Sigma_1^{-1}+\Sigma_2^{-1})\left( y - (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right) \right. \right. \\ &\hspace{125pt} \left. \left. - \left( (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top(\Sigma_1^{-1}+\Sigma_2^{-1})(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right. \right. \\ &\hspace{125pt} \left. \left. + x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z \right)\right) dy \\ = &\frac{(2\pi)^{N/2}|(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}|^{1/2}}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - \left( (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)\right) \\ \end{aligned} \]

\(\Sigma_3 := \Sigma_1 + \Sigma_2\)

\[ \begin{aligned} \Sigma_1 (\Sigma_1^{-1}+\Sigma_2^{-1}) \Sigma_2 &= I \Sigma_2 + \Sigma_1 I \\ &= \Sigma_3 \end{aligned} \]
\[ \begin{aligned} (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1} &= (\Sigma_1^{-1} \Sigma_3 \Sigma_2^{-1})^{-1} \\ &= \Sigma_2 \Sigma_3^{-1} \Sigma_1 \end{aligned} \]

所以,

\[ \begin{aligned} &\left(p_\mathcal{N}(\cdot | 0, \Sigma_1) \ast p_\mathcal{N}(\cdot | z, \Sigma_2)\right)(x) \\ = &\frac{(2\pi)^{N/2}|(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}|^{1/2}}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - \left( (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)\right) \\ = &\frac{(2\pi)^{N/2}|\Sigma_2 \Sigma_3^{-1} \Sigma_1|^{1/2}}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - \left( \Sigma_2 \Sigma_3^{-1} \Sigma_1(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)\right) \\ = &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - \left( \Sigma_1(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top\Sigma_3^{-1}\Sigma_2(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)\right) \\ = &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - \left( x+\Sigma_1\Sigma_2^{-1} z\right)^\top\Sigma_3^{-1}(\Sigma_2\Sigma_1^{-1} x+z) \right)\right) \\ = &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - x^\top\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1}x - x^\top\Sigma_3^{-1}z - z^\top\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1}x - z^\top\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1}z \right)\right) \\ = &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1})x - x^\top\Sigma_3^{-1}z - z^\top\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1}x + z^\top(\Sigma_2^{-1}-\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1})z \right)\right) \\ \end{aligned} \]

其中

\[ \begin{aligned} \Sigma_1^{-1}-\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1} &= \Sigma_1^{-1}-\Sigma_3^{-1}(\Sigma_3 - \Sigma_1)\Sigma_1^{-1} \\ &= \Sigma_1^{-1}-\Sigma_3^{-1}\Sigma_3\Sigma_1^{-1}+\Sigma_3^{-1}\Sigma_1\Sigma_1^{-1} \\ &= \Sigma_1^{-1}-\Sigma_1^{-1}+\Sigma_3^{-1} \\ &= \Sigma_3^{-1} \\ \\ \Sigma_2^{-1}-\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1} &= \Sigma_2^{-1}-\Sigma_2^{-1}(\Sigma_3-\Sigma_2)\Sigma_3^{-1} \\ &= \Sigma_2^{-1}-\Sigma_2^{-1}\Sigma_3\Sigma_3^{-1}+\Sigma_2^{-1}\Sigma_2\Sigma_3^{-1} \\ &= \Sigma_2^{-1}-\Sigma_2^{-1}+\Sigma_3^{-1} \\ &= \Sigma_3^{-1} \\ \end{aligned} \]

利用上面的等式,我们可以得到

\[ \begin{aligned} \Sigma_2^{-1}\Sigma_1\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1} &= (\Sigma_2^{-1} - \Sigma_3^{-1})\Sigma_2\Sigma_1^{-1} \\ &= \Sigma_1^{-1} - \Sigma_3^{-1}\Sigma_2\Sigma_1^{-1} \\ &= \Sigma_3^{-1} \\ \end{aligned} \]

最终结果为

\[ \begin{aligned} &\left(p_\mathcal{N}(\cdot | 0, \Sigma_1) \ast p_\mathcal{N}(\cdot | z, \Sigma_2)\right)(x) \\ = &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1})x - x^\top\Sigma_3^{-1}z - z^\top\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1}x + z^\top(\Sigma_2^{-1}-\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1})z \right)\right) \\ = &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_3^{-1}x - x^\top\Sigma_3^{-1}z - z^\top\Sigma_3^{-1}x + z^\top\Sigma_3^{-1}z \right)\right) \\ = &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( (x-z)^\top\Sigma_3^{-1}(x-z) \right)\right) \\ = &p_\mathcal{N}(x | z, \Sigma_3) \\ = &p_\mathcal{N}(x | z, \Sigma_1 + \Sigma_2) \\ \end{aligned} \]

\(x_1\sim \mathcal{N}(\mu_1, \Sigma_1), x_2\sim \mathcal{N}(\mu_2, \Sigma_2)\),同样\(x = x_1+x_2\),其分布使用卷积表达为

\[ \begin{aligned} p(x) &= \int p_\mathcal{N}(x_1 | \mu_1, \Sigma_1) p_\mathcal{N}(x-x_1 | \mu_2, \Sigma_2) dx_1 \\ &= \int p_\mathcal{N}(x - (x-x_1+\mu_1) | 0, \Sigma_1) p_\mathcal{N}(x-x_1+\mu_1 | \mu_1+\mu_2, \Sigma_2) dx_1 \\ &= \int p_\mathcal{N}(x - y | 0, \Sigma_1) p_\mathcal{N}(y | \mu_1+\mu_2, \Sigma_2) dy \\ &= p_\mathcal{N}(x | \mu_1 + \mu_2, \Sigma_1 + \Sigma_2) \end{aligned} \]

结果和一维一致:

\[ x_1 + x_2 \sim \mathcal{N}(\mu_1 + \mu_2, \Sigma_1 + \Sigma_2) \]

乘法叠加性

一般来说,两个正太分布的乘积是正态分布的常数倍。

\[ \begin{aligned} p_\mathcal{N}(x|\mu_1, \Sigma_1)p_\mathcal{N}(x|\mu_2, \Sigma_2) &\propto \exp\left( -\frac{1}{2}\left( (x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1) + (x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)\right)\right) \\ &\propto \exp\left(-\frac{1}{2}\left(x^T(\Sigma_1^{-1}+\Sigma_2^{-1})x - x^T(\Sigma_1^{-1}\mu_1 + \Sigma_2^{-1}\mu_2) - (\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2)^Tx \right) \right) \\ &\propto \exp\left( -\frac{1}{2}\left( x^T(\Sigma_1^{-1}+\Sigma_2^{-1})x - x^T(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2) - (\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2)^Tx \right) \right) \\ &\propto \exp\left( -\frac{1}{2}\left( (x-(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2))^T(\Sigma_1^{-1 }+\Sigma_2^{-1})(x-(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2)) \right) \right) \\ \\ &\propto p_\mathcal{N}\left(x\middle|(\Sigma_1+\Sigma_2)^{-1}(\Sigma_2\mu_1+\Sigma_1\mu_2),\Sigma_1\Sigma_2(\Sigma_1+\Sigma_2)^{-1}\right) \\ \end{aligned} \]

四、Jensen不等式

首先我们来定义凸函数。

\(\mathbf{I}\)\(\mathbb{R}\)上的区间,给定它上定义的实值函数\(f: \mathbf{I}\rightarrow \mathbb{R}\)。那么下面五个性质是等价的,都可以用来刻画凸函数

  1. 对于任意的\(x, y \in \mathbf{I}\), 任意的\(t \in [0, 1]\),我们有
\[ f(tx + (1-t)y)\le tf(x) + (1-t)f(y) \]

凸函数图形

  1. \(\forall x, y, z \in \mathbf{I}\),如果\(x<y<z\), 那么
\[ \frac{f(y)-f(x)}{y-x}\le\frac{f(z)-f(x)}{z-x}\le\frac{f(z)-f(y)}{z-y} \]

  1. \(\forall a \in \mathbf{I}\),如下定义的函数
    \(\mathbf{I}-{a}\rightarrow \mathbb{R}, x \rightarrow \frac{f(x)-f(a)}{x-a}\)
    是变量\(x\)的递增函数。

  2. \(\forall x, y, z \in \mathbf{I}\), 如果

\[ \text{det}\left( \begin{bmatrix}1&1&1 \\ x&y&z \\ f(x)&f(y)&f(z) \end{bmatrix} \ge 0 \right) \]
  1. 集合\(\mathcal{F}_{\ge f} = \{(x, y)|x \in \mathbf{I}, y \ge f(x)\} \subset \mathbb{R}^2\)是凸集。

有了凸函数的定义,我们就可以来定义Jensen不等式

假设\(f: \mathbf{I}\rightarrow \mathbb{R}\)是凸函数,那么\(\forall x_1, \cdots, x_n \in \mathbf{I}\)\(\forall t_1, \cdots, t_n \in [0, 1]\),其中\(\sum_{i=1}^n t_i = 1\), 我们有

\[ f(t_1x_1+\cdots+t_nx_n) \le t_1f(x_1)+\cdots+t_nf(x_n) \]

如果把\(t_i\)看成是事件\(x_i\)的概率,那么\(t_1x_1+\cdots+t_nx_n=E[x]\)\(t_1f(x_1)+\cdots+t_nf(x_n)=E[f(x)]\),也就是说上面公式变成了

\[ f(E[x]) \le E[f(x)] \]

根据凸函数的描述,我们可以很容易检验\(-\log\)函数是凸函数,也就是\(\log\)函数是一个凹函数,可以得到

\[ \log(E(x)) \ge E[\log(x)] \]

五、KL散度

定义

一个众人皆知的事件x (p(x)=1)带给我们的信息量等于0,因为大家都知道,它发生了也不会带来额外的信息。相反,一个概率很小的事件发生了,带给我们的信息确很大。它告诉我们需要有新的方面需要考虑。也就是说,一个事件的信息量和它的概率成反比,我们把它记作

\[ I_p(x) = -\log p(x) \]

\(q(x)\)\(p(x)\)带来的信息差为

\[ \Delta I = I_p - I_q = \log \frac{q(x)}{p(x)} \]

KL散度就是上述差异的期望

\[ \begin{aligned} KL(q(x)||p(x)) &= E_{\sim q}(\Delta I) \\ &= \int (\Delta I)q(x) dx \\ &= \int q(x)\log \frac{q(x)}{p(x)} dx \end{aligned} \]

因为 \(\log t \le t-1\)

\[ \begin{aligned} KL(q(x)|| p(x)) &= \int q(x)\log \frac{q(x)}{p(x)} dx \\ &= -\int q(x)\log (\frac{q(x)}{p(x)})^{-1} dx \\ &= -\int q(x)\log \frac{p(x)}{q(x)} dx \\ &\ge -\int q(x) (\frac{p(x)}{q(x)} - 1)dx \\ &= -\int (p(x) - q(x))dx \\ &= -(\int p(x) dx - \int q(x)dx) \\ &= 0 \end{aligned} \]

也就是说,KL散度永远是大于等于0的值。上面公式用到了\(\log (x) = \log (x-1+1) = x-1 - \frac{(x-1)^2}{2} + ... \le x - 1\)

使用KL散度的时候, \(q(x)\)一般代表数据的真实分布,\(p(x)\)代表我们使用某种方法得到的真实分布的近似。

对于离散采样,我们总是有没有观察到的事件,导致\(p(x), q(x)\)无法完全对上。比如

\[ \begin{aligned} P: (a: 3/5, b: 1/5, c: 1/5) \\ Q: (a: 5/9, b: 3/9, d: 1/9) \end{aligned} \]

为了计算KL(P||Q),我们引入一个极小常量\(\epsilon = 10^{-3}\),然后修改P和Q如下

\[ \begin{aligned} P: (a: 3/5-\epsilon/3, b: 1/5-\epsilon/3, c: 1/5-\epsilon/3, d: \epsilon) \\ Q: (a: 5/9-\epsilon/3, b: 3/9-\epsilon/3, c: \epsilon, d: 1/9-\epsilon/3) \end{aligned} \]

这样KL散度就可以计算了。

正态分布之间的KL散度

假设我们有两个\(N\)维正态分布

\[ p(x) = p_\mathcal{N}(x|\mu_p, \Sigma_p) = \frac{1}{(2\pi)^{N/2}|\Sigma_p|^{1/2}}\exp\left( -\frac{1}{2}(x-\mu_p)^T\Sigma_p^{-1}(x-\mu_p) \right) \\ q(x) = q_\mathcal{N}(x|\mu_q, \Sigma_q) = \frac{1}{(2\pi)^{N/2}|\Sigma_q|^{1/2}}\exp\left( -\frac{1}{2}(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q) \right) \]

它们之间的KL散度为

\[ \begin{aligned} D_{KL}\left(p\|q\right) &= \int p(x) \log \frac{p(x)}{q(x)} dx \\ &= \int p(x) \left(\log p(x) - \log q(x)\right) dx \\ &= \int p(x) \frac{1}{2} \left(-N\log\left(2\pi\right) - \log |\Sigma_p| - (x-\mu_p)^\top\Sigma_p^{-1}(x-\mu_p) + N\log\left(2\pi\right) + \log |\Sigma_q| + (x-\mu_q)^\top\Sigma_q^{-1}(x-\mu_q) \right) dx \\ &= \int p(x) \frac{1}{2} \left(\log \frac{|\Sigma_q|}{|\Sigma_p|} - (x-\mu_p)^\top\Sigma_p^{-1}(x-\mu_p) + (x-\mu_q)^\top\Sigma_q^{-1}(x-\mu_q) \right) dx \\ &= \frac{1}{2}\left( \log \frac{|\Sigma_q|}{|\Sigma_p|} - \int p(x)(x-\mu_p)^\top\Sigma_p^{-1}(x-\mu_p)dx + \int p(x)(x-\mu_q)^\top\Sigma_q^{-1}(x-\mu_q)dx \right) \\ \end{aligned} \]

矩阵的迹(trace)具有交换性,也就是

\[ tr(AB) = tr(BA) \]

\(x^TAx\)是一个实数,也可以看做是一个\(1\times 1\)的矩阵,因此

\[ \begin{aligned} x^TAx &= tr(x^TAx) \\ &= tr(Axx^T) \end{aligned} \]

因此,我们可以得到

\[ \begin{aligned} &\int p(x)(x-\mu_p)^\top\Sigma_p^{-1}(x-\mu_p)dx \\ =& \int p(x)tr\left(\Sigma_p^{-1}(x-\mu_p)(x-\mu_p)^\top\right)dx \\ =& tr\left(\Sigma_p^{-1}\int p(x)(x-\mu_p)(x-\mu_p)^\top dx\right) \\ =& tr\left(\Sigma_p^{-1}\Sigma_p\right) \\ =& tr\left(I\right) \\ =& N \\ \end{aligned} \]
\[ \begin{aligned} &\int p(x)(x-\mu_q)^\top\Sigma_q^{-1}(x-\mu_q)dx \\ =& \int p(x)tr\left(\Sigma_q^{-1}(x-\mu_q)(x-\mu_q)^\top\right)dx \\ =& tr\left(\Sigma_q^{-1}\int p(x)(x-\mu_q)(x-\mu_q)^\top dx\right) \\ =& tr\left(\Sigma_q^{-1}\int p(x)(x-\mu_p+\mu_p-\mu_q)(x-\mu_p+\mu_p-\mu_q)^\top dx\right) \\ =& tr\left(\Sigma_q^{-1}\int p(x)\left((x-\mu_p)(x-\mu_p)^\top+(x-\mu_p)(\mu_p-\mu_q)^\top+(\mu_p-\mu_q)(x-\mu_p)^\top+(\mu_p-\mu_q)(\mu_p-\mu_q)^\top\right) dx\right) \\ =& tr\left(\Sigma_q^{-1}\left(\Sigma_p + 0 + 0 + (\mu_p-\mu_q)(\mu_p-\mu_q)^\top \right) \right) \\ =& tr\left(\Sigma_q^{-1}\Sigma_p\right) + tr\left(\Sigma_q^{-1}(\mu_p-\mu_q)(\mu_p-\mu_q)^\top\right) \\ =& tr\left(\Sigma_q^{-1}\Sigma_p\right) + (\mu_p-\mu_q)^\top\Sigma_q^{-1}(\mu_p-\mu_q) \\ \end{aligned} \]

因此,两个正太分布之间的KL散度为:

\[ \begin{aligned} D_{KL}\left(p\|q\right) &= \frac{1}{2}\left( \log \frac{|\Sigma_q|}{|\Sigma_p|} - \int p(x)(x-\mu_p)^\top\Sigma_p^{-1}(x-\mu_p)dx + \int p(x)(x-\mu_q)^\top\Sigma_q^{-1}(x-\mu_q)dx \right) \\ &= \frac{1}{2}\left( \log \frac{|\Sigma_q|}{|\Sigma_p|} - N + tr\left(\Sigma_q^{-1}\Sigma_p\right) + (\mu_p-\mu_q)^\top\Sigma_q^{-1}(\mu_p-\mu_q) \right) \\ \end{aligned} \]

总结一下,两个正太分布

\[ p(x) = p_\mathcal{N}(x|\mu_p, \Sigma_p) = \frac{1}{(2\pi)^{N/2}|\Sigma_p|^{1/2}}\exp\left( -\frac{1}{2}(x-\mu_p)^T\Sigma_p^{-1}(x-\mu_p) \right) \\ q(x) = q_\mathcal{N}(x|\mu_q, \Sigma_q) = \frac{1}{(2\pi)^{N/2}|\Sigma_q|^{1/2}}\exp\left( -\frac{1}{2}(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q) \right) \]

的KL散度为

\[ D_{KL}\left(p\|q\right) = \frac{1}{2}\left( \log \frac{|\Sigma_q|}{|\Sigma_p|} - N + tr\left(\Sigma_q^{-1}\Sigma_p\right) + (\mu_p-\mu_q)^\top\Sigma_q^{-1}(\mu_p-\mu_q) \right) \]