Trust Region Policy Optimization
我们描述了一个优化策略的迭代过程,保证了单调的改进。 通过对理论上合理的过程进行几次近似,我们开发了一种称为信任区域策略优化(TRPO)的实用算法。 该算法与自然策略梯度方法类似,对于优化大型非线性策略(如神经网络)是有效的。 我们的实验证明了它在各种任务中的强大性能:学习模拟机器人游泳,跳跃和步行步态; 并使用屏幕图像作为输入玩Atari游戏。 尽管它的近似值偏离了理论,但TRPO倾向于提供单调的改进,通过微调超参数。
方法
预备
定义期望折扣回报
η ( π ) = E s 0 , a 0 , … [ ∑ t = 0 ∞ γ t r ( s t ) ] , where s 0 ∼ ρ 0 ( s 0 ) , a t ∼ π ( a t ∣ s t ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) \begin{array}{l}{\eta(\pi)=\mathbb{E}_{s 0, a_{0}, \ldots}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}\right)\right], \text { where }} \\ {s_{0} \sim \rho_{0}\left(s_{0}\right), a_{t} \sim \pi\left(a_{t} | s_{t}\right), s_{t+1} \sim P\left(s_{t+1} | s_{t}, a_{t}\right)}\end{array} η ( π ) = E s 0 , a 0 , … [ ∑ t = 0 ∞ γ t r ( s t ) ] , where s 0 ∼ ρ 0 ( s 0 ) , a t ∼ π ( a t ∣ s t ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t )
定义价值函数、动作价值函数、动作优势价值函数
Q π ( s t , a t ) = E s t + 1 , a t + 1 , … [ ∑ l = 0 ∞ γ l r ( s t + l ) ] V π ( s t ) = E a t , s t + 1 , … [ ∑ l = 0 ∞ γ l r ( s t + l ) ] A π ( s , a ) = Q π ( s , a ) − V π ( s ) , where a t ∼ π ( a t ∣ s t ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) for t ≥ 0 \begin{array}{l}{Q_{\pi}\left(s_{t}, a_{t}\right)=\mathbb{E}_{s_{t+1}, a_{t+1}, \ldots}\left[\sum_{l=0}^{\infty} \gamma^{l} r\left(s_{t+l}\right)\right]} \\ {V_{\pi}\left(s_{t}\right)=\mathbb{E}_{a_{t}, s_{t+1}, \ldots}\left[\sum_{l=0}^{\infty} \gamma^{l} r\left(s_{t+l}\right)\right]} \\ {A_{\pi}(s, a)=Q_{\pi}(s, a)-V_{\pi}(s), \text { where }} \\ {a_{t} \sim \pi\left(a_{t} | s_{t}\right), s_{t+1} \sim P\left(s_{t+1} | s_{t}, a_{t}\right) \text { for } t \geq 0}\end{array} Q π ( s t , a t ) = E s t + 1 , a t + 1 , … [ ∑ l = 0 ∞ γ l r ( s t + l ) ] V π ( s t ) = E a t , s t + 1 , … [ ∑ l = 0 ∞ γ l r ( s t + l ) ] A π ( s , a ) = Q π ( s , a ) − V π ( s ) , where a t ∼ π ( a t ∣ s t ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) for t ≥ 0 定义策略 π ~ \tilde{\pi} π ~ 相对于策略 π \pi π 的期望优势
η ( π ~ ) = η ( π ) + E s 0 , a 0 , ⋯ ∼ π ~ [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ] \eta(\tilde{\pi})=\eta(\pi)+\mathbb{E}_{s_{0}, a_{0}, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] η ( π ~ ) = η ( π ) + E s 0 , a 0 , ⋯ ∼ π ~ [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ]
即动作根据π ~ \tilde{\pi} π ~ 采样,策略评估还是使用 π \pi π
令 ρ π ( s ) = P ( s 0 = s ) + γ P ( s 1 = s ) + γ 2 P ( s 2 = s ) + … \rho_{\pi}(s)=P\left(s_{0}=s\right)+\gamma P\left(s_{1}=s\right)+\gamma^{2} P\left(s_{2}=s\right)+\ldots ρ π ( s ) = P ( s 0 = s ) + γ P ( s 1 = s ) + γ 2 P ( s 2 = s ) + … ,上式可写为
η ( π ~ ) = η ( π ) + ∑ t = 0 ∑ s P ( s t = s ∣ π ~ ) ∑ a π ~ ( a ∣ s ) γ t A π ( s , a ) = η ( π ) + ∑ s ∑ t = 0 ∞ γ t P ( s t = s ∣ π ~ ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) = η ( π ) + ∑ s ρ π ~ ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) \begin{aligned} \eta(\tilde{\pi}) &=\eta(\pi)+\sum_{t=0} \sum_{s} P\left(s_{t}=s | \tilde{\pi}\right) \sum_{a} \tilde{\pi}(a | s) \gamma^{t} A_{\pi}(s, a) \\ &=\eta(\pi)+\sum_{s} \sum_{t=0}^{\infty} \gamma^{t} P\left(s_{t}=s | \tilde{\pi}\right) \sum_{a} \tilde{\pi}(a | s) A_{\pi}(s, a) \\ &=\eta(\pi)+\sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a | s) A_{\pi}(s, a) \end{aligned} η ( π ~ ) = η ( π ) + t = 0 ∑ s ∑ P ( s t = s ∣ π ~ ) a ∑ π ~ ( a ∣ s ) γ t A π ( s , a ) = η ( π ) + s ∑ t = 0 ∑ ∞ γ t P ( s t = s ∣ π ~ ) a ∑ π ~ ( a ∣ s ) A π ( s , a ) = η ( π ) + s ∑ ρ π ~ ( s ) a ∑ π ~ ( a ∣ s ) A π ( s , a ) 这个式子表明只要能保证 ∑ a π ~ ( a ∣ s ) A π ( s , a ) ≥ 0 \sum_{a} \tilde{\pi}(a | s) A_{\pi}(s, a) \geq 0 ∑ a π ~ ( a ∣ s ) A π ( s , a ) ≥ 0 ,策略的期望回报就能增大。考虑策略迭代 π ~ ( s ) = arg max a A π ( s , a ) \tilde{\pi}(s)=\arg \max _{a} A_{\pi}(s, a) π ~ ( s ) = arg max a A π ( s , a ) ,只要一个 A π A_\pi A π 大于0,策略就会被提升,否则收敛。但是在近似的情况下,由于误差的存在,不可避免的有 ∑ a π ~ ( a ∣ s ) A π ( s , a ) < 0 \sum_{a} \tilde{\pi}(a | s) A_{\pi}(s, a)<0 ∑ a π ~ ( a ∣ s ) A π ( s , a ) < 0 出现。为了简化问题我们提出了下面的近似公式:
L π ( π ~ ) = η ( π ) + ∑ s ρ π ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a | s) A_{\pi}(s, a) L π ( π ~ ) = η ( π ) + s ∑ ρ π ( s ) a ∑ π ~ ( a ∣ s ) A π ( s , a ) 注意这里使用的是 ρ π ( s ) \rho_{\pi}(s) ρ π ( s ) 而不是 ρ π ~ ( s ) \rho_{\tilde{\pi}}(s) ρ π ~ ( s ) ,即使用旧策略采样的状态频率。
因为 L π L_{\pi} L π 与 η \eta η 一阶匹配,则有梯度
L π θ 0 ( π θ 0 ) = η ( π θ 0 ) ∇ θ L π θ 0 ( π θ ) ∣ θ = θ 0 = ∇ θ η ( π θ ) ∣ θ = θ 0 \begin{aligned} L_{\pi_{\theta_{0}}}\left(\pi_{\theta_{0}}\right) &=\eta\left(\pi_{\theta_{0}}\right) \\ \nabla_{\theta} L_{\pi_{\theta_{0}}}\left.\left(\pi_{\theta}\right)\right|_{\theta=\theta_{0}} &=\nabla_{\theta} \eta\left.\left(\pi_{\theta}\right)\right|_{\theta=\theta_{0}} \end{aligned} L π θ 0 ( π θ 0 ) ∇ θ L π θ 0 ( π θ ) ∣ θ = θ 0 = η ( π θ 0 ) = ∇ θ η ( π θ ) ∣ θ = θ 0 上式给出了一个增强策略的梯度方向,然而却没有指出合适的步长,为了解决这个问题我们采用新的思路:
令 π ′ = arg max π ′ L π o l d ( π ′ ) \pi^{\prime}=\arg \max _{\pi^{\prime}} L_{\pi_{\mathrm{old}}}\left(\pi^{\prime}\right) π ′ = arg max π ′ L π old ( π ′ ) ,令新策略为 π new ( a ∣ s ) = ( 1 − α ) π old ( a ∣ s ) + α π ′ ( a ∣ s ) \pi_{\text { new }}(a | s)=(1-\alpha) \pi_{\text { old }}(a | s)+\alpha \pi^{\prime}(a | s) π new ( a ∣ s ) = ( 1 − α ) π old ( a ∣ s ) + α π ′ ( a ∣ s )
根据 Kakade & Langford (2002) 等人的相关工作有下界关系:
η ( π new ) ≥ L π old ( π new ) − 2 ϵ γ ( 1 − γ ) 2 α 2 w h e r e ϵ = max s ∣ E a ∼ π ′ ( a ∣ s ) [ A π ( s , a ) ] ∣ \eta\left(\pi_{\text { new }}\right) \geq L_{\pi_{\text { old }}}\left(\pi_{\text { new }}\right)-\frac{2 \epsilon \gamma}{(1-\gamma)^{2}} \alpha^{2} \\ where\ \epsilon=\max _{s}\left|\mathbb{E}_{a \sim \pi^{\prime}(a | s)}\left[A_{\pi}(s, a)\right]\right| η ( π new ) ≥ L π old ( π new ) − ( 1 − γ ) 2 2 ϵ γ α 2 w h ere ϵ = max s E a ∼ π ′ ( a ∣ s ) [ A π ( s , a ) ]
我们只要让下界单调递增,则可实现策略的提升
一般随机策略的单调改进保证
我们用total variation divergence作为 α \alpha α 的度量 :
D T V max ( π , π ~ ) = max s D T V ( π ( ⋅ ∣ s ) ∥ π ~ ( ⋅ ∣ s ) ) D_{\mathrm{TV}}^{\max }(\pi, \tilde{\pi})=\max _{s} D_{T V}(\pi(\cdot | s) \| \tilde{\pi}(\cdot | s)) D TV m a x ( π , π ~ ) = max s D T V ( π ( ⋅ ∣ s ) ∥ π ~ ( ⋅ ∣ s )) ,有定理(证明见原文附录)
注意到其与KL散度的关系 D T V ( p ∥ q ) 2 ≤ D K L ( p ∥ q ) D_{T V}(p \| q)^{2} \leq D_{\mathrm{KL}}(p \| q) D T V ( p ∥ q ) 2 ≤ D KL ( p ∥ q ) ,定理可写为
于是有如下算法,可以保证策略的单调递增
证明:令 M i ( π ) = L π i ( π ) − C D K L max ( π i , π ) M_{i}(\pi)=L_{\pi_{i}}(\pi)-C D_{\mathrm{KL}}^{\max }\left(\pi_{i}, \pi\right) M i ( π ) = L π i ( π ) − C D KL m a x ( π i , π ) ,结合上式有
我们提出了算法1的近似算法Trust region policy optimization,其使用KL散度作为约束而不是惩罚项,以支持健壮的大步更新。
参数化策略的优化
我们把算法1的问题从
maximize θ [ L θ old ( θ ) − C D K L max ( θ old , θ ) ] \underset{\theta}{\operatorname{maximize}}\left[L_{\theta_{\text { old }}}(\theta)-C D_{\mathrm{KL}}^{\max }\left(\theta_{\text { old }}, \theta\right)\right] θ maximize [ L θ old ( θ ) − C D KL m a x ( θ old , θ ) ]
转换为带约束的最大化问题
maximize L θ old ( θ ) subject to D K L max ( θ old , θ ) ≤ δ \begin{array}{l}{\text { maximize } L_{\theta_{\text { old }}}(\theta)} \\ {\text { subject to } D_{\mathrm{KL}}^{\max }\left(\theta_{\text { old }}, \theta\right) \leq \delta}\end{array} maximize L θ old ( θ ) subject to D KL m a x ( θ old , θ ) ≤ δ
同时为了方便解决约束问题,我们用平均KL散度替换最大KL散度
D ‾ K L ρ ( θ 1 , θ 2 ) : = E s ∼ ρ [ D K L ( π θ 1 ( ⋅ ∣ s ) ∥ π θ 2 ( ⋅ ∣ s ) ] \overline{D}_{\mathrm{KL}}^{\rho}\left(\theta_{1}, \theta_{2}\right) :=\mathbb{E}_{s \sim \rho}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot | s) \| \pi_{\theta_{2}}(\cdot | s)\right]\right. D KL ρ ( θ 1 , θ 2 ) := E s ∼ ρ [ D KL ( π θ 1 ( ⋅ ∣ s ) ∥ π θ 2 ( ⋅ ∣ s ) ]
于是得到
maximize L θ old ( θ ) subject to D ‾ K L ρ θ old ( θ old , θ ) ≤ δ \begin{array}{l}{\text { maximize } L_{\theta_{\text { old }}}(\theta)} \\ {\text { subject to } \overline{D}_{\mathrm{KL}}^{\rho \theta_{\text { old }}}\left(\theta_{\text { old }}, \theta\right) \leq \delta}\end{array} maximize L θ old ( θ ) subject to D KL ρ θ old ( θ old , θ ) ≤ δ
基于采样的目标和约束估计
本节旨在用蒙特卡洛模拟近似上面的优化问题,展开 L θ old ( θ ) L_{\theta \text { old }}(\theta) L θ old ( θ )
首先用 1 1 − γ E s ∼ ρ θ old [ … ] \frac{1}{1-\gamma} \mathbb{E}_{s \sim \rho_{\theta_{\text { old }}}}[\ldots] 1 − γ 1 E s ∼ ρ θ old [ … ] 替换 ∑ S ρ θ old ( s ) [ … ] \sum_{S} \rho_{\theta_{\text { old }}}(s)[\ldots] ∑ S ρ θ old ( s ) [ … ] ,然后用用 Q θ old Q_{\theta_{\text { old }}} Q θ old 替换 A θ old A_{\theta_{\text { old }}} A θ old
最后用重要性采样处理动作求和
∑ a π θ ( a ∣ s n ) A θ old ( s n , a ) = E a ∼ q [ π θ ( a ∣ s n ) q ( a ∣ s n ) A θ old ( s n , a ) ] \sum_{a} \pi_{\theta}(a | s_{n}) A_{\theta_{\text { old }}}\left(s_{n}, a\right)=\mathbb{E}_{a \sim q}\left[\frac{\pi_{\theta}(a | s_{n})}{q(a | s_{n})} A_{\theta_{\text { old }}}\left(s_{n}, a\right)\right] ∑ a π θ ( a ∣ s n ) A θ old ( s n , a ) = E a ∼ q [ q ( a ∣ s n ) π θ ( a ∣ s n ) A θ old ( s n , a ) ]
得到等价优化目标
剩下的就是用样本均值代替期望值,用经验估计代替Q值。以下部分描述了两种不同的方案来执行这种估计。
实际算法
通常可以分为三步
原文附录C给出了具体的优化算法
总结前面的内容
与前面工作的联系
使用L的线性近似和KL散度的二阶近似,自然梯度更新可以视为TRPO的特例
maximize [ ∇ θ L θ old ( θ ) ∣ θ = θ old ⋅ ( θ − θ old ) ] subject to 1 2 ( θ old − θ ) T A ( θ old ) ( θ old − θ ) ≤ δ where A ( θ old ) i j = ∂ ∂ θ i ∂ ∂ θ j E s ∼ ρ π [ D K L ( π ( ⋅ ∣ s , θ old ) ∥ π ( ⋅ ∣ s , θ ) ) ] ∣ θ = θ old \begin{array}{l}{\text { maximize }\left[\nabla_{\theta} L_{\theta_{\text { old }}}\left.(\theta)\right|_{\theta=\theta_{\text { old }}} \cdot\left(\theta-\theta_{\text { old }}\right)\right]} \\ {\text { subject to } \frac{1}{2}\left(\theta_{\text { old }}-\theta\right)^{T} A\left(\theta_{\text { old }}\right)\left(\theta_{\text { old }}-\theta\right) \leq \delta} \\ {\text { where } A\left(\theta_{\text { old }}\right)_{i j}=} \\ {\frac{\partial}{\partial \theta_{i}} \frac{\partial}{\partial \theta_{j}} \mathbb{E}_{s \sim \rho_{\pi}}\left.\left[D_{\mathrm{KL}}(\pi(\cdot | s, \theta_{\text { old }}) \| \pi(\cdot | s, \theta))\right]\right|_{\theta=\theta_{\text { old }}}}\end{array} maximize [ ∇ θ L θ old ( θ ) ∣ θ = θ old ⋅ ( θ − θ old ) ] subject to 2 1 ( θ old − θ ) T A ( θ old ) ( θ old − θ ) ≤ δ where A ( θ old ) ij = ∂ θ i ∂ ∂ θ j ∂ E s ∼ ρ π [ D KL ( π ( ⋅ ∣ s , θ old ) ∥ π ( ⋅ ∣ s , θ )) ] ∣ θ = θ old 更新式: θ new = θ old + 1 λ A ( θ old ) − 1 ∇ θ L ( θ ) ∣ θ = θ old \theta_{\text { new }}=\theta_{\text { old }}+\frac{1}{\lambda} A\left(\theta_{\text { old }}\right)^{-1} \nabla_{\theta} L\left.(\theta)\right|_{\theta=\theta_{\text { old }}} θ new = θ old + λ 1 A ( θ old ) − 1 ∇ θ L ( θ ) ∣ θ = θ old
实验