# DDPG + Mixing policy targets

> [On-policy vs. off-policy updates for deep reinforcement learning](https://www.cs.utexas.edu/~pstone/Papers/bib2html-links/DeepRL16-hausknecht.pdf)

基于时间差异的深度加强学习方法通​​常由off-policy Q-Learning引导更新。在本文中，我们将研究使用on-policy，Monte Carlo更新的效果。 我们的实证结果表明，对于连续作用空间中的DDPG算法，与仅使用一个或另一个目标相比，混合策略上和非策略更新目标表现出优越的性能和稳定性。在离散动作空间中应用于DQN的相同技术大大减慢了学习。 我们的发现提出了关于on-policy和off-policy和蒙特卡罗更新的性质及其与深度强化学习方法的关系的问题。

## 方法

### 时间差分和蒙特卡洛的关系

![](/files/-Lc3iru3A1-0RRZEYFDg)

![](/files/-Lc3tPcDSzHX3UU3HyHE)

![](/files/-Lc3j1kkfoxFgKUK0AMc)

### Computing On-Policy MC Targets

![](/files/-Lc3tjw252X9_cxVP4kQ)

### Mixing Update Targets

![](/files/-Lc3tsQPBsr1HTzb1A25)

## 实验

### Results in discrete action space

DQN架构\[8]使用深度神经网络和1步Q-Learning更新来估计每个离散行为的Q值。 使用Arcade学习环境\[2]，我们评估混合更新对Beam Rider，Breakout，Pong，QBert和Space Invaders的Atari游戏的影响。

![](/files/-Lc3uH5-P12kzXVAucFp)

### Results: DDPG

Half Field Offense Domain

回报函数

![](/files/-Lc3uWnVtIGNvPKhEhTb)

![](/files/-Lc3uat2ebh4NRB19IzL)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://hujian.gitbook.io/deep-reinforcement-learning/fang-fa/tuan-dui-ti-yu-you-xi/ddpg-+-mixing-policy-targets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
