筆記 Proximal Policy Optimization Algorithms

 · 1 min read

論文出處 Proximal Policy Optimization Algorithms

基礎

Policy Gradient

\[\mathbb E_{(s_t, a_t) \sim \pi_{\theta}}[A^{\theta}(s_t, a_t)\nabla \log p_{\theta}(a_t^n|s_t^n)]\]

Important Sampling

\[\mathbb E_{x \sim p}[f(x)] = \int f(x)p(x)dx = \int f(x)\frac {p(x)} {q(x)} q(x) dx = \mathbb E_{x \sim q}[f(x)\frac {p(x)}{q(x)}]\]

On-policy -> Off-policy

\[\mathbb E_{(s_t, a_t) \sim \pi_{\theta}}[A^{\theta}(s_t, a_t)\nabla \log p_{\theta}(a_t^n|s_t^n)]\] \[= \mathbb E_{(s_t, a_t) \sim \pi_{\theta^{\prime}}}[\frac {P_{\theta}(s_t, a_t)}{P_{\theta^{\prime}}(s_t, a_t)}A^{\theta}(s_t, a_t)\nabla \log p_{\theta}(a_t^n|s_t^n)]\] \[= \mathbb E_{(s_t, a_t) \sim \pi_{\theta^{\prime}}}[\frac {p_{\theta}(a_t|s_t)p_{\theta}(s_t)}{p_{\theta^{\prime}}(a_t|s_t)p_{\theta^{\prime}}(s_t)}A^{\theta^{\prime}}(s_t, a_t)\nabla \log p_{\theta}(a_t^n|s_t^n)]\] \[= \mathbb E_{(s_t, a_t) \sim \pi_{\theta^{\prime}}}[\frac {p_{\theta}(a_t|s_t)}{p_{\theta^{\prime}}(a_t|s_t)}A^{\theta^{\prime}}(s_t, a_t)\nabla \log p_{\theta}(a_t^n|s_t^n)]\]

Background : Policy Optimization

Policy Gradient Methods

假設我們要訓練的 Policy 是

\[\pi_{\theta}(a_t|s_t)\]

則他的梯度可表示為:

\[\hat g = \hat{\mathbb E}\big[ \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) \hat {A_t} \big]\]
  • $\pi_{\theta}$ is a stochastic policy
  • $\hat {A_t}$ is an estimator of the advantage function at timestep $t$.

則 Objetive 為:

\[L^{PG}(\theta)[\log \pi_{\theta}(a_t|s_t)\hat {A_t}]\]

TRPO

\[\operatorname*{maximize}_{\theta} \hat {\mathbb E} \bigg[ \frac {\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat {A_t} \bigg]\]

上式還有一項限制:

\[\hat {\mathbb E}[KL[\pi_{\theta}(\cdot|s_t), \pi_{\theta_{old}}(\cdot|s_t)]] \leq \delta\]

PPO

\[\operatorname*{maximize}_{\theta} \hat {\mathbb E} \bigg[ \frac {\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat {A_t} - \beta KL[\pi_{\theta}(\cdot|s_t), \pi_{\theta_{old}}(\cdot|s_t)] \bigg]\]