一、最大似然
我们有观察到随机变量\(X = \{X_1, X_2, \cdots, X_n\}\),并且知道它们的概率依赖于未知的参数\(\theta\)。我们想要找到对\(\theta\)的一个比较好的估计。
一个假设是如果\(\theta_i\)能够最大化我们已经观察到的所有随机变量的概率值,那么它大概率就是我们需要寻找的位置参数的值。即便不是,也肯定是一个非常好的近似。
我们已经观察到的所有随机变量的值可以构成一个似然函数
\[
\begin{aligned}
L(\theta) &= P(X_1=x_1, X_2=x_2, \cdots, X_n = x_n) \\
&= p(x_1|\theta)\cdot p(x_2|\theta)\cdots p(x_n|\theta) \\
&= \prod_{i=1}^n p(x_i|\theta)
\end{aligned}
\]
我们可以取对数把乘法变成加法,得到对数似然函数(log likelihood function)
\[
\begin{aligned}
l(\theta) &= \log L(\theta) \\
&= \log \prod_{i=1}^n p(x_i|\theta) \\
&= \sum_{i=1}^n \log p(x_i|\theta)
\end{aligned}
\]
这样,寻找未知的\(\theta\)的解决方案就是我们最大化\(l(\theta)\)。
比如有位置的变量\(\theta\),我们观察到如下随机变量的值
| X |
0 |
1 |
2 |
3 |
| p(X) |
\(\frac{3\theta}{4}\) |
\(\frac{\theta}{4}\) |
\(\frac{2(1-\theta)}{3}\) |
\(\frac{1-\theta}{3}\) |
对应的似然函数为
\[
\begin{aligned}
L(\theta) &= \frac{3\theta}{4} * \frac{\theta}{4} * \frac{2(1-\theta)}{3} * \frac{1-\theta}{3} \\
& = \frac{1}{24}\theta^2(1-\theta)^2
\end{aligned}
\]
对数似然函数为
\[
\begin{aligned}
l(\theta) &= \log L(\theta) \\
&= \log \frac{1}{24} + 2 \log \theta + 2 \log (1-\theta)
\end{aligned}
\]
对\(\theta\)求导有
\[
\frac{d l(\theta)}{d \theta} = \frac{2}{\theta} - \frac{2}{1-\theta}
\]
要让\(l(\theta)\)最大,我们可以让导数等于0,这样就可以得到\(\theta = \frac{1}{2}\)。
这样,我们就通过最大似然函数得到位置变量的一个估计。
最大似然背后的直观解释就是如果一个模型能够生成和训练数据集很像的数据,那么生成的不在训练数据集里面的内容也应该和训练数据集里面的内容很像。
二、独立随机变量和卷积的关系
基本结论
两个独立随机变量的和的概率密度函数是它们各自的分布的卷积。
2.1 离散情形
设 \(X\) 和 \(Y\) 是两个相互独立的离散随机变量,其概率质量函数(PMF)分别为 \(p_X(k)\)、\(p_Y(k)\)。
令 \(Z = X + Y\),那么 \(Z\) 的概率质量函数为:
\[
p_Z(n) = \sum_{k=-\infty}^{\infty} p_X(k) \cdot p_Y(n - k)
\]
这就是 离散卷积 的定义。
2.2 连续情形
设 \(X\) 和 \(Y\) 是两个相互独立的连续随机变量,其概率密度函数(PDF)分别为 \(f_X(x)\)、\(f_Y(x)\)。
令 \(Z = X + Y\),那么 \(Z\) 的概率密度函数为:
\[
f_Z(z) = \int_{-\infty}^{\infty} f_X(x) \cdot f_Y(z - x)\,dx
\]
这就是 连续卷积 的定义。
三、正态分布
正太分布卷积
当两个随机变量遵循不同的正态分布时,我们研究随机变量之和的分布情况。
一维正态分布的概率密度函数表示如下:
\[
p(x)= \mathcal{N}(x | \mu,\sigma^2)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
\]
其中,\(\mu\in \mathbb{R}\)是分布的平均值;\(\sigma\in \mathbb{R}\)是标准差。
\[
\begin{aligned}
\mu &= \mathbb{E}[x] = \int p(x)xdx \\
\sigma^2 &= \mathbb{E}[(x-\mu)^2] = \int p(x)(x-\mu)^2 dx
\end{aligned}
\]
下面我们来看两个正态分布\(\mathcal{N}(\cdot | 0, \sigma_1^2)\)和\(\mathcal{N}(\cdot|z, \sigma_2^2)\)的卷积
\[
\begin{aligned}
&\left(p_\mathcal{N}(\cdot | 0, \sigma_1^2) \ast p_\mathcal{N}(\cdot | z, \sigma_2^2)\right)(x) \\
= &\int p_\mathcal{N}(x-y | 0, \sigma_1^2) p_\mathcal{N}(y | z, \sigma_2^2) dy \\
= &\int \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp \left(-\frac{(x-y)^2}{2\sigma_1^2}\right) \frac{1}{\sqrt{2\pi\sigma_2^2}} \exp \left(-\frac{(y-z)^2}{2\sigma_2^2}\right) dy \\
= &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \int \exp \left( -\frac{(x-y)^2}{2\sigma_1^2}-\frac{(y-z)^2}{2\sigma_2^2} \right) dy \\
= &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \int \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( \sigma_2^2 (y-x)^2 + \sigma_1^2 (y-z)^2 \right) \right) dy \\
= &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \int \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( \sigma_2^2 (y^2 - 2xy + x^2) + \sigma_1^2 (y^2 - 2zy + z^2) \right) \right) dy \\
= &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \int \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( (\sigma_1^2 + \sigma_2^2)y^2 - 2(\sigma_2^2 x + \sigma_1^2 z)y + \sigma_2^2 x^2 + \sigma_1^2 z^2 \right) \right) dy \\
= &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \int \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( (\sigma_1^2 + \sigma_2^2) \left(y - \frac{\sigma_2^2 x + \sigma_1^2 z}{\sigma_1^2 + \sigma_2^2}\right)^2 - \frac{\left( \sigma_2^2 x + \sigma_1^2 z \right)^2}{\sigma_1^2 + \sigma_2^2} + \sigma_2^2 x^2 + \sigma_1^2 z^2 \right) \right) dy \\
= &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( - \frac{\left( \sigma_2^2 x + \sigma_1^2 z \right)^2}{\sigma_1^2 + \sigma_2^2} + \sigma_2^2 x^2 + \sigma_1^2 z^2 \right) \right) \int \exp \left( -\frac{\sigma_1^2 + \sigma_2^2}{2\sigma_1^2 \sigma_2^2} \left(y - \frac{\sigma_2^2 x + \sigma_1^2 z}{\sigma_1^2 + \sigma_2^2}\right)^2 \right) dy \\
= &\frac{1}{2\pi\sqrt{\sigma_1^2 \sigma_2^2}} \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( - \frac{\left( \sigma_2^2 x + \sigma_1^2 z \right)^2}{\sigma_1^2 + \sigma_2^2} + \sigma_2^2 x^2 + \sigma_1^2 z^2 \right) \right) \sqrt{2\pi \frac{\sigma_1^2 \sigma_2^2}{\sigma_1^2 + \sigma_2^2}} \\
= &\frac{1}{\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}} \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2} \left( - \frac{\left( \sigma_2^2 x + \sigma_1^2 z \right)^2}{\sigma_1^2 + \sigma_2^2} + \sigma_2^2 x^2 + \sigma_1^2 z^2 \right) \right) \\
= &\frac{1}{\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}} \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2(\sigma_1^2 + \sigma_2^2)} \left( - \left( \sigma_2^2 x + \sigma_1^2 z \right)^2 + (\sigma_1^2 + \sigma_2^2)\sigma_2^2 x^2 + (\sigma_1^2 + \sigma_2^2)\sigma_1^2 z^2 \right) \right) \\
= &\frac{1}{\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}} \exp \left( -\frac{1}{2\sigma_1^2 \sigma_2^2(\sigma_1^2 + \sigma_2^2)} \left( -2 \sigma_1^2\sigma_2^2 xz + \sigma_1^2\sigma_2^2 x^2 + \sigma_1^2\sigma_2^2 z^2 \right) \right) \\
= &\frac{1}{\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}} \exp \left( -\frac{1}{2(\sigma_1^2 + \sigma_2^2)} \left( -2 xz + x^2 + z^2 \right) \right) \\
= &\frac{1}{\sqrt{2\pi(\sigma_1^2 + \sigma_2^2)}} \exp \left( -\frac{(x-z)^2}{2(\sigma_1^2 + \sigma_2^2)} \right) \\
= &p_\mathcal{N}(x | z, \sigma_1^2 + \sigma_2^2)
\end{aligned}
\]
加法叠加性
我们再来看\(x_1 \sim \mathcal{N}(\mu_1, \sigma_1^2)\), \(x_2 \sim \mathcal{N}(\mu_2, \sigma_2^2)\)。如果\(x = x_1 + x_2\),使用卷积计算其分布有
\[
\begin{aligned}
p(x)
&= \int p_\mathcal{N}(x_1|\mu_1, \sigma_1^2)p_\mathcal{N}(x-x_1|\mu_2, \sigma_2^2) dx_1 \\
&= \int p_\mathcal{N}(x-(x - x_1+\mu_1)|0, \sigma_1^2)p_\mathcal{N}(x-x_1+\mu_1|\mu_1+\mu_2, \sigma_2^2) dx_1 \\
&= \int p_\mathcal{N}(x - y|0, \sigma_1^2)p_\mathcal{N}(y|\mu_1+\mu_2, \sigma_2^2) d y \\
&= p_\mathcal{N}(x|\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)
\end{aligned}
\]
也就是两个服从正太分布的随机变量的和也服从正态分布。
\[
x_1+x_2 \sim \mathcal{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)
\]
推广到高维数据,我们其实有一样的结论。
首先计算\(p_\mathcal{N}(\cdot|0, \Sigma_1) * p_\mathcal{N}(\cdot|z, \Sigma_2)\)
\[
\begin{aligned}
&\left(p_\mathcal{N}(\cdot | 0, \Sigma_1) \ast p_\mathcal{N}(\cdot | z, \Sigma_2)\right)(x) \\
= &\int p_\mathcal{N}(x-y | 0, \Sigma_1) p_\mathcal{N}(y | z, \Sigma_2) dy \\
= &\int \frac{1}{(2\pi)^{N/2}|\Sigma_1|^{1/2}} \exp \left(-\frac{1}{2}(x-y)^\top\Sigma_1^{-1}(x-y)\right) \frac{1}{(2\pi)^{N/2}|\Sigma_2|^{1/2}} \exp \left(-\frac{1}{2}(y-z)^\top\Sigma_2^{-1}(y-z)\right) dy \\
= &\frac{1}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}}\int \exp \left(-\frac{1}{2}\left( (x-y)^\top\Sigma_1^{-1}(x-y) + (y-z)^\top\Sigma_2^{-1}(y-z) \right)\right) dy \\
= &\frac{1}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}}\int \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x - x^\top\Sigma_1^{-1} y - y^\top\Sigma_1^{-1} x + y^\top\Sigma_1^{-1} y + y^\top\Sigma_2^{-1} y - y^\top\Sigma_2^{-1} z - z^\top\Sigma_2^{-1} y + z^\top\Sigma_2^{-1} z \right)\right) dy \\
= &\frac{1}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}}\int \exp \left(-\frac{1}{2}\left( y^\top(\Sigma_1^{-1}+\Sigma_2^{-1})y - (x^\top\Sigma_1^{-1}+z^\top\Sigma_2^{-1})y - y^\top(\Sigma_1^{-1} x+\Sigma_2^{-1} z) + x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z \right)\right) dy \\
= &\frac{1}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}}\int \exp \left(-\frac{1}{2}\left( y^\top(\Sigma_1^{-1}+\Sigma_2^{-1})y - (\Sigma_1^{-1} x+\Sigma_2^{-1} z)^\top y - y^\top(\Sigma_1^{-1} x+\Sigma_2^{-1} z) + x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z \right)\right) dy \\
= &\frac{1}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}}\int \exp \left(-\frac{1}{2}\left( \left( y - (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top(\Sigma_1^{-1}+\Sigma_2^{-1})\left( y - (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right) \right. \right. \\
&\hspace{125pt} \left. \left. - \left( (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top(\Sigma_1^{-1}+\Sigma_2^{-1})(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right. \right. \\
&\hspace{125pt} \left. \left. + x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z \right)\right) dy \\
= &\frac{(2\pi)^{N/2}|(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}|^{1/2}}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - \left( (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)\right) \\
\end{aligned}
\]
令\(\Sigma_3 := \Sigma_1 + \Sigma_2\)
\[
\begin{aligned}
\Sigma_1 (\Sigma_1^{-1}+\Sigma_2^{-1}) \Sigma_2 &= I \Sigma_2 + \Sigma_1 I \\
&= \Sigma_3
\end{aligned}
\]
\[
\begin{aligned}
(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1} &= (\Sigma_1^{-1} \Sigma_3 \Sigma_2^{-1})^{-1} \\
&= \Sigma_2 \Sigma_3^{-1} \Sigma_1
\end{aligned}
\]
所以,
\[
\begin{aligned}
&\left(p_\mathcal{N}(\cdot | 0, \Sigma_1) \ast p_\mathcal{N}(\cdot | z, \Sigma_2)\right)(x) \\
= &\frac{(2\pi)^{N/2}|(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}|^{1/2}}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - \left( (\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)\right) \\
= &\frac{(2\pi)^{N/2}|\Sigma_2 \Sigma_3^{-1} \Sigma_1|^{1/2}}{(2\pi)^N|\Sigma_1 \Sigma_2|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - \left( \Sigma_2 \Sigma_3^{-1} \Sigma_1(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)\right) \\
= &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - \left( \Sigma_1(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)^\top\Sigma_3^{-1}\Sigma_2(\Sigma_1^{-1} x+\Sigma_2^{-1} z) \right)\right) \\
= &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - \left( x+\Sigma_1\Sigma_2^{-1} z\right)^\top\Sigma_3^{-1}(\Sigma_2\Sigma_1^{-1} x+z) \right)\right) \\
= &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_1^{-1} x + z^\top\Sigma_2^{-1} z - x^\top\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1}x - x^\top\Sigma_3^{-1}z - z^\top\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1}x - z^\top\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1}z \right)\right) \\
= &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1})x - x^\top\Sigma_3^{-1}z - z^\top\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1}x + z^\top(\Sigma_2^{-1}-\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1})z \right)\right) \\
\end{aligned}
\]
其中
\[
\begin{aligned}
\Sigma_1^{-1}-\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1} &= \Sigma_1^{-1}-\Sigma_3^{-1}(\Sigma_3 - \Sigma_1)\Sigma_1^{-1} \\
&= \Sigma_1^{-1}-\Sigma_3^{-1}\Sigma_3\Sigma_1^{-1}+\Sigma_3^{-1}\Sigma_1\Sigma_1^{-1} \\
&= \Sigma_1^{-1}-\Sigma_1^{-1}+\Sigma_3^{-1} \\
&= \Sigma_3^{-1} \\
\\
\Sigma_2^{-1}-\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1} &= \Sigma_2^{-1}-\Sigma_2^{-1}(\Sigma_3-\Sigma_2)\Sigma_3^{-1} \\
&= \Sigma_2^{-1}-\Sigma_2^{-1}\Sigma_3\Sigma_3^{-1}+\Sigma_2^{-1}\Sigma_2\Sigma_3^{-1} \\
&= \Sigma_2^{-1}-\Sigma_2^{-1}+\Sigma_3^{-1} \\
&= \Sigma_3^{-1} \\
\end{aligned}
\]
利用上面的等式,我们可以得到
\[
\begin{aligned}
\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1} &= (\Sigma_2^{-1} - \Sigma_3^{-1})\Sigma_2\Sigma_1^{-1} \\
&= \Sigma_1^{-1} - \Sigma_3^{-1}\Sigma_2\Sigma_1^{-1} \\
&= \Sigma_3^{-1} \\
\end{aligned}
\]
最终结果为
\[
\begin{aligned}
&\left(p_\mathcal{N}(\cdot | 0, \Sigma_1) \ast p_\mathcal{N}(\cdot | z, \Sigma_2)\right)(x) \\
= &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1})x - x^\top\Sigma_3^{-1}z - z^\top\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1}\Sigma_2\Sigma_1^{-1}x + z^\top(\Sigma_2^{-1}-\Sigma_2^{-1}\Sigma_1\Sigma_3^{-1})z \right)\right) \\
= &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( x^\top\Sigma_3^{-1}x - x^\top\Sigma_3^{-1}z - z^\top\Sigma_3^{-1}x + z^\top\Sigma_3^{-1}z \right)\right) \\
= &\frac{1}{(2\pi)^{N/2}|\Sigma_3|^{1/2}} \exp \left(-\frac{1}{2}\left( (x-z)^\top\Sigma_3^{-1}(x-z) \right)\right) \\
= &p_\mathcal{N}(x | z, \Sigma_3) \\
= &p_\mathcal{N}(x | z, \Sigma_1 + \Sigma_2) \\
\end{aligned}
\]
\(x_1\sim \mathcal{N}(\mu_1, \Sigma_1), x_2\sim \mathcal{N}(\mu_2, \Sigma_2)\),同样\(x = x_1+x_2\),其分布使用卷积表达为
\[
\begin{aligned}
p(x) &= \int p_\mathcal{N}(x_1 | \mu_1, \Sigma_1) p_\mathcal{N}(x-x_1 | \mu_2, \Sigma_2) dx_1 \\
&= \int p_\mathcal{N}(x - (x-x_1+\mu_1) | 0, \Sigma_1) p_\mathcal{N}(x-x_1+\mu_1 | \mu_1+\mu_2, \Sigma_2) dx_1 \\
&= \int p_\mathcal{N}(x - y | 0, \Sigma_1) p_\mathcal{N}(y | \mu_1+\mu_2, \Sigma_2) dy \\
&= p_\mathcal{N}(x | \mu_1 + \mu_2, \Sigma_1 + \Sigma_2)
\end{aligned}
\]
结果和一维一致:
\[
x_1 + x_2 \sim \mathcal{N}(\mu_1 + \mu_2, \Sigma_1 + \Sigma_2)
\]
乘法叠加性
一般来说,两个正太分布的乘积是正态分布的常数倍。
\[
\begin{aligned}
p_\mathcal{N}(x|\mu_1, \Sigma_1)p_\mathcal{N}(x|\mu_2, \Sigma_2)
&\propto \exp\left( -\frac{1}{2}\left( (x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1) + (x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)\right)\right) \\
&\propto \exp\left(-\frac{1}{2}\left(x^T(\Sigma_1^{-1}+\Sigma_2^{-1})x - x^T(\Sigma_1^{-1}\mu_1 + \Sigma_2^{-1}\mu_2) - (\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2)^Tx \right) \right) \\
&\propto \exp\left( -\frac{1}{2}\left( x^T(\Sigma_1^{-1}+\Sigma_2^{-1})x - x^T(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2) - (\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2)^Tx \right) \right) \\
&\propto \exp\left( -\frac{1}{2}\left( (x-(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2))^T(\Sigma_1^{-1 }+\Sigma_2^{-1})(x-(\Sigma_1^{-1}+\Sigma_2^{-1})^{-1}(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2)) \right) \right) \\
\\
&\propto p_\mathcal{N}\left(x\middle|(\Sigma_1+\Sigma_2)^{-1}(\Sigma_2\mu_1+\Sigma_1\mu_2),\Sigma_1\Sigma_2(\Sigma_1+\Sigma_2)^{-1}\right) \\
\end{aligned}
\]
四、Jensen不等式
首先我们来定义凸函数。
\(\mathbf{I}\)是\(\mathbb{R}\)上的区间,给定它上定义的实值函数\(f: \mathbf{I}\rightarrow \mathbb{R}\)。那么下面五个性质是等价的,都可以用来刻画凸函数:
- 对于任意的\(x, y \in \mathbf{I}\), 任意的\(t \in [0, 1]\),我们有
\[
f(tx + (1-t)y)\le tf(x) + (1-t)f(y)
\]

- \(\forall x, y, z \in \mathbf{I}\),如果\(x<y<z\), 那么
\[
\frac{f(y)-f(x)}{y-x}\le\frac{f(z)-f(x)}{z-x}\le\frac{f(z)-f(y)}{z-y}
\]

-
\(\forall a \in \mathbf{I}\),如下定义的函数
\(\mathbf{I}-{a}\rightarrow \mathbb{R}, x \rightarrow \frac{f(x)-f(a)}{x-a}\)
是变量\(x\)的递增函数。
-
\(\forall x, y, z \in \mathbf{I}\), 如果
\[
\text{det}\left(
\begin{bmatrix}1&1&1 \\ x&y&z \\ f(x)&f(y)&f(z) \end{bmatrix} \ge 0
\right)
\]
- 集合\(\mathcal{F}_{\ge f} = \{(x, y)|x \in \mathbf{I}, y \ge f(x)\} \subset \mathbb{R}^2\)是凸集。
有了凸函数的定义,我们就可以来定义Jensen不等式:
假设\(f: \mathbf{I}\rightarrow \mathbb{R}\)是凸函数,那么\(\forall x_1, \cdots, x_n \in \mathbf{I}\)和\(\forall t_1, \cdots, t_n \in [0, 1]\),其中\(\sum_{i=1}^n t_i = 1\), 我们有
\[
f(t_1x_1+\cdots+t_nx_n) \le t_1f(x_1)+\cdots+t_nf(x_n)
\]
如果把\(t_i\)看成是事件\(x_i\)的概率,那么\(t_1x_1+\cdots+t_nx_n=E[x]\), \(t_1f(x_1)+\cdots+t_nf(x_n)=E[f(x)]\),也就是说上面公式变成了
\[
f(E[x]) \le E[f(x)]
\]
根据凸函数的描述,我们可以很容易检验\(-\log\)函数是凸函数,也就是\(\log\)函数是一个凹函数,可以得到
\[
\log(E(x)) \ge E[\log(x)]
\]
五、KL散度
定义
一个众人皆知的事件x (p(x)=1)带给我们的信息量等于0,因为大家都知道,它发生了也不会带来额外的信息。相反,一个概率很小的事件发生了,带给我们的信息确很大。它告诉我们需要有新的方面需要考虑。也就是说,一个事件的信息量和它的概率成反比,我们把它记作
\[
I_p(x) = -\log p(x)
\]
\(q(x)\)和\(p(x)\)带来的信息差为
\[
\Delta I = I_p - I_q = \log \frac{q(x)}{p(x)}
\]
KL散度就是上述差异的期望
\[
\begin{aligned}
KL(q(x)||p(x)) &= E_{\sim q}(\Delta I) \\
&= \int (\Delta I)q(x) dx \\
&= \int q(x)\log \frac{q(x)}{p(x)} dx
\end{aligned}
\]
因为 \(\log t \le t-1\)
\[
\begin{aligned}
KL(q(x)|| p(x)) &= \int q(x)\log \frac{q(x)}{p(x)} dx \\
&= -\int q(x)\log (\frac{q(x)}{p(x)})^{-1} dx \\
&= -\int q(x)\log \frac{p(x)}{q(x)} dx \\
&\ge -\int q(x) (\frac{p(x)}{q(x)} - 1)dx \\
&= -\int (p(x) - q(x))dx \\
&= -(\int p(x) dx - \int q(x)dx) \\
&= 0
\end{aligned}
\]
也就是说,KL散度永远是大于等于0的值。上面公式用到了\(\log (x) = \log (x-1+1) = x-1 - \frac{(x-1)^2}{2} + ... \le x - 1\)。
使用KL散度的时候, \(q(x)\)一般代表数据的真实分布,\(p(x)\)代表我们使用某种方法得到的真实分布的近似。
对于离散采样,我们总是有没有观察到的事件,导致\(p(x), q(x)\)无法完全对上。比如
\[
\begin{aligned}
P: (a: 3/5, b: 1/5, c: 1/5) \\
Q: (a: 5/9, b: 3/9, d: 1/9)
\end{aligned}
\]
为了计算KL(P||Q),我们引入一个极小常量\(\epsilon = 10^{-3}\),然后修改P和Q如下
\[
\begin{aligned}
P: (a: 3/5-\epsilon/3, b: 1/5-\epsilon/3, c: 1/5-\epsilon/3, d: \epsilon) \\
Q: (a: 5/9-\epsilon/3, b: 3/9-\epsilon/3, c: \epsilon, d: 1/9-\epsilon/3)
\end{aligned}
\]
这样KL散度就可以计算了。
正态分布之间的KL散度
假设我们有两个\(N\)维正态分布
\[
p(x) = p_\mathcal{N}(x|\mu_p, \Sigma_p) = \frac{1}{(2\pi)^{N/2}|\Sigma_p|^{1/2}}\exp\left( -\frac{1}{2}(x-\mu_p)^T\Sigma_p^{-1}(x-\mu_p) \right) \\
q(x) = q_\mathcal{N}(x|\mu_q, \Sigma_q) = \frac{1}{(2\pi)^{N/2}|\Sigma_q|^{1/2}}\exp\left( -\frac{1}{2}(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q) \right)
\]
它们之间的KL散度为
\[
\begin{aligned}
D_{KL}\left(p\|q\right) &= \int p(x) \log \frac{p(x)}{q(x)} dx \\
&= \int p(x) \left(\log p(x) - \log q(x)\right) dx \\
&= \int p(x) \frac{1}{2} \left(-N\log\left(2\pi\right) - \log |\Sigma_p| - (x-\mu_p)^\top\Sigma_p^{-1}(x-\mu_p) + N\log\left(2\pi\right) + \log |\Sigma_q| + (x-\mu_q)^\top\Sigma_q^{-1}(x-\mu_q) \right) dx \\
&= \int p(x) \frac{1}{2} \left(\log \frac{|\Sigma_q|}{|\Sigma_p|} - (x-\mu_p)^\top\Sigma_p^{-1}(x-\mu_p) + (x-\mu_q)^\top\Sigma_q^{-1}(x-\mu_q) \right) dx \\
&= \frac{1}{2}\left( \log \frac{|\Sigma_q|}{|\Sigma_p|} - \int p(x)(x-\mu_p)^\top\Sigma_p^{-1}(x-\mu_p)dx + \int p(x)(x-\mu_q)^\top\Sigma_q^{-1}(x-\mu_q)dx \right) \\
\end{aligned}
\]
矩阵的迹(trace)具有交换性,也就是
\[
tr(AB) = tr(BA)
\]
\(x^TAx\)是一个实数,也可以看做是一个\(1\times 1\)的矩阵,因此
\[
\begin{aligned}
x^TAx &= tr(x^TAx) \\
&= tr(Axx^T)
\end{aligned}
\]
因此,我们可以得到
\[
\begin{aligned}
&\int p(x)(x-\mu_p)^\top\Sigma_p^{-1}(x-\mu_p)dx \\
=& \int p(x)tr\left(\Sigma_p^{-1}(x-\mu_p)(x-\mu_p)^\top\right)dx \\
=& tr\left(\Sigma_p^{-1}\int p(x)(x-\mu_p)(x-\mu_p)^\top dx\right) \\
=& tr\left(\Sigma_p^{-1}\Sigma_p\right) \\
=& tr\left(I\right) \\
=& N \\
\end{aligned}
\]
\[
\begin{aligned}
&\int p(x)(x-\mu_q)^\top\Sigma_q^{-1}(x-\mu_q)dx \\
=& \int p(x)tr\left(\Sigma_q^{-1}(x-\mu_q)(x-\mu_q)^\top\right)dx \\
=& tr\left(\Sigma_q^{-1}\int p(x)(x-\mu_q)(x-\mu_q)^\top dx\right) \\
=& tr\left(\Sigma_q^{-1}\int p(x)(x-\mu_p+\mu_p-\mu_q)(x-\mu_p+\mu_p-\mu_q)^\top dx\right) \\
=& tr\left(\Sigma_q^{-1}\int p(x)\left((x-\mu_p)(x-\mu_p)^\top+(x-\mu_p)(\mu_p-\mu_q)^\top+(\mu_p-\mu_q)(x-\mu_p)^\top+(\mu_p-\mu_q)(\mu_p-\mu_q)^\top\right) dx\right) \\
=& tr\left(\Sigma_q^{-1}\left(\Sigma_p + 0 + 0 + (\mu_p-\mu_q)(\mu_p-\mu_q)^\top \right) \right) \\
=& tr\left(\Sigma_q^{-1}\Sigma_p\right) + tr\left(\Sigma_q^{-1}(\mu_p-\mu_q)(\mu_p-\mu_q)^\top\right) \\
=& tr\left(\Sigma_q^{-1}\Sigma_p\right) + (\mu_p-\mu_q)^\top\Sigma_q^{-1}(\mu_p-\mu_q) \\
\end{aligned}
\]
因此,两个正太分布之间的KL散度为:
\[
\begin{aligned}
D_{KL}\left(p\|q\right) &= \frac{1}{2}\left( \log \frac{|\Sigma_q|}{|\Sigma_p|} - \int p(x)(x-\mu_p)^\top\Sigma_p^{-1}(x-\mu_p)dx + \int p(x)(x-\mu_q)^\top\Sigma_q^{-1}(x-\mu_q)dx \right) \\
&= \frac{1}{2}\left( \log \frac{|\Sigma_q|}{|\Sigma_p|} - N + tr\left(\Sigma_q^{-1}\Sigma_p\right) + (\mu_p-\mu_q)^\top\Sigma_q^{-1}(\mu_p-\mu_q) \right) \\
\end{aligned}
\]
总结一下,两个正太分布
\[
p(x) = p_\mathcal{N}(x|\mu_p, \Sigma_p) = \frac{1}{(2\pi)^{N/2}|\Sigma_p|^{1/2}}\exp\left( -\frac{1}{2}(x-\mu_p)^T\Sigma_p^{-1}(x-\mu_p) \right) \\
q(x) = q_\mathcal{N}(x|\mu_q, \Sigma_q) = \frac{1}{(2\pi)^{N/2}|\Sigma_q|^{1/2}}\exp\left( -\frac{1}{2}(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q) \right)
\]
的KL散度为
\[
D_{KL}\left(p\|q\right)
= \frac{1}{2}\left( \log \frac{|\Sigma_q|}{|\Sigma_p|} - N + tr\left(\Sigma_q^{-1}\Sigma_p\right) + (\mu_p-\mu_q)^\top\Sigma_q^{-1}(\mu_p-\mu_q) \right)
\]