論文出處 Proximal Policy Optimization Algorithms
基礎
Policy Gradient
\[\mathbb E_{(s_t, a_t) \sim \pi_{\theta}}[A^{\theta}(s_t, a_t)\nabla \log p_{\theta}(a_t^n|s_t^n)]\]Important Sampling
\[\mathbb E_{x \sim p}[f(x)] = \int f(x)p(x)dx = \int f(x)\frac {p(x)} {q(x)} q(x) dx = \mathbb E_{x \sim q}[f(x)\frac {p(x)}{q(x)}]\]On-policy -> Off-policy
\[\mathbb E_{(s_t, a_t) \sim \pi_{\theta}}[A^{\theta}(s_t, a_t)\nabla \log p_{\theta}(a_t^n|s_t^n)]\] \[= \mathbb E_{(s_t, a_t) \sim \pi_{\theta^{\prime}}}[\frac {P_{\theta}(s_t, a_t)}{P_{\theta^{\prime}}(s_t, a_t)}A^{\theta}(s_t, a_t)\nabla \log p_{\theta}(a_t^n|s_t^n)]\] \[= \mathbb E_{(s_t, a_t) \sim \pi_{\theta^{\prime}}}[\frac {p_{\theta}(a_t|s_t)p_{\theta}(s_t)}{p_{\theta^{\prime}}(a_t|s_t)p_{\theta^{\prime}}(s_t)}A^{\theta^{\prime}}(s_t, a_t)\nabla \log p_{\theta}(a_t^n|s_t^n)]\] \[= \mathbb E_{(s_t, a_t) \sim \pi_{\theta^{\prime}}}[\frac {p_{\theta}(a_t|s_t)}{p_{\theta^{\prime}}(a_t|s_t)}A^{\theta^{\prime}}(s_t, a_t)\nabla \log p_{\theta}(a_t^n|s_t^n)]\]Background : Policy Optimization
Policy Gradient Methods
假設我們要訓練的 Policy 是
\[\pi_{\theta}(a_t|s_t)\]則他的梯度可表示為:
\[\hat g = \hat{\mathbb E}\big[ \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) \hat {A_t} \big]\]- $\pi_{\theta}$ is a stochastic policy
- $\hat {A_t}$ is an estimator of the advantage function at timestep $t$.
則 Objetive 為:
\[L^{PG}(\theta)[\log \pi_{\theta}(a_t|s_t)\hat {A_t}]\]TRPO
\[\operatorname*{maximize}_{\theta} \hat {\mathbb E} \bigg[ \frac {\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat {A_t} \bigg]\]上式還有一項限制:
\[\hat {\mathbb E}[KL[\pi_{\theta}(\cdot|s_t), \pi_{\theta_{old}}(\cdot|s_t)]] \leq \delta\]