A Brief Survey of Deep Reinforcement Learning

深度强化学习的简要概述

文章目录

    • 摘要 Abstract
    • 1. 引言 Introduction
    • 2. 奖励驱动行为 Reward-Driven Behavior
      • 2.1. 马尔科夫决策过程 Markov Decision Processes
      • 2.2. 强化学习的挑战 Challenges in RL
    • 3. 强化学习算法 RL Algorithms
      • 3.1. 价值函数 Value Functions
      • 3.2. 动态规划 Dynamic Programming
      • 3.3. 采样 Sampling
      • 3.4. 策略搜索 Policy Search
        • 3.4.1. 策略梯度 Policy Gradients
        • 3.4.2. 演员-评论家算法 Actor-Critic Methods
      • 3.5. 规划和学习 Planning and Learning
      • 3.6. 深度强化学习的兴起 The Rise of DRL
    • 4. 价值函数 Value Functions
      • 4.1. 函数估计和深度Q学习 Function Approximation and the DQN
      • 4.2. Q函数的修改 Q-Function Modifications
    • 5. 策略搜索 Policy Search
      • 5.1. 通过随机函数的反向传播 Backpropagation Through Stochastic Functions
      • 5.2. 复合误差 Compounding Errors
      • 5.3. 演员-评论家算法 Actor-Critic Methods
    • 6. 当前的研究和挑战 Current Research and Challenges
      • 6.1. 基于模型的强化学习 Model-Based RL
      • 6.2. 探索和利用 Exploration Versus Exploitation
      • 6.3. 层次化强化学习 Hierarchical RL
      • 6.4. 模仿学习与逆强化学习 Imitation Learning and Inverse RL
      • 6.5. 多Agent强化学习 Multiagent RL
    • 7. 结论:超越模式识别 Conclusion: Beyond Pattern Recognition
    • 8. 致谢 ACKNOWLEDGMENTS

摘要 Abstract

Deep reinforcement learning (DRL) is poised to revolutionize the field of artificial intelligence (AI) and represents a step toward building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning (RL) to scale to problems that were previously intractable, such as learning to play video games directly from pixels. DRL algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of RL, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep RL, including the deep Q-network (DQN), trust region policy optimization (TRPO), and asynchronous advantage actor critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via RL. To conclude, we describe several current areas of research within the field.

深度强化学习(DRL)马上就会彻底改变人工智能(AI)领域,它在构建对视觉世界有更高层次理解的自主系统中是一座里程碑。目前,深度学习正使得强化学习(RL)可以扩展到以前难以解决的问题,如直接从屏幕像素学习玩电子游戏。DRL算法也适用于机器人,允许机器人的控制策略直接从现实世界的摄像头中输入学习。在本篇调查 告中,我们首先介绍RL的一般方法,然后过渡到基于价值和基于策略的主流方法。本篇调查 告将涵盖Deep RL中的主要算法,包括深度Q 络(deep Q-network, DQN)、信任域策略优化算法(trust region policy optimization, TRPO)和异步优势演员-评论家算法(asynchronous advantage actor-critical, A3C)。同时,我们强调了深度神经 络的独特优势,重点关注RL的视觉理解。最后,我们描述了深度强化学习领域目前的几个研究重点。

1. 引言 Introduction

One of the primary goals of the field of artificial intelligence(AI) is to produce fully autonomous agents that interact with their environments to learn optimal behaviours, improving over time through trial and error. Crafting AI systems that are responsive and can effectively learn has been a long-standing challenge, ranging from robots, which can sense and react to the world around them, to purely software-based agents, which can interact with natural language and multimedia. A principled mathematical framework for experience-driven autonomous learning is reinforcement learning (RL) [135]. Al-though RL had some successes in the past [141, 129, 62, 93], previous approaches lacked scalability and were inherently limited to fairly low-dimensional problems. These limitations exist because RL algorithms share the same complexity issues as other algorithms: memory complexity, computational complexity, and in the case of machine learning algorithms, sample complexity [133]. What we have witnessed in recent years—the rise of deep learning, relying on the powerful function approximation and representation learning properties of deep neural networks—has provided us with new tools to overcoming these problems.

人工智能(AI)领域的主要目标之一是生产完全自主的Agent,这些Agent与环境交互,学习最佳的行为,并通过反复试验不断改进。从能够感知周围世界并做出反应的机器人,到能够与自然语言和多媒体交互的纯软件Agent,打造具有响应能力和有效学习能力的人工智能系统一直是一个长期存在的挑战。经验驱动式自主学习的一个基本数学框架是强化学习(RL)[135]。尽管RL在过去取得了一些成功[141、129、62、93],但以前的方法缺乏可伸缩性,并且天生局限于相当低维的问题。这些局限性的存在是因为RL算法与其他算法有相同的复杂性问题:内存复杂性、计算复杂性,以及在机器学习算法中的样本复杂性[133]。近年来,我们所看到的深度学习的兴起,依靠深度神经 络强大的函数逼近和表征学习的特性,为我们提供了克服这些问题的新工具。

The advent of deep learning has had a significant impact on many areas in machine learning, dramatically improving the state of the art in tasks such as object detection, speech recognition, and language translation [39]. The most important property of deep learning is that deep neural networks can automatically find compact low-dimensional representations (features) of high-dimensional data (e.g., images, text, and audio). Through crafting inductive biases into neural network architectures, particularly that of hierarchical representations, machine-learning practitioners have made effective progress in addressing the curse of dimensionality [7]. Deep learning has similarly accelerated progress in RL, with the use of deep-learning algorithms within RL defining the field of DRL. The aim of this survey is to cover both seminal and recent developments in DRL, conveying the innovative ways in which neural networks can be used to bring us closer toward developing autonomous agents. For a more comprehensive survey of recent efforts in DRL, we refer readers to the overview by Li [43].

深度学习的出现对机器学习的许多领域产生了重大影响,极大地提高了目标检测、语音识别和语言翻译[39]等任务的技术水平。深度学习最重要的特性是,深度神经 络可以自动找到高维数据(如图像、文本和音频)的相关低维表示(特征)。通过将归纳偏差融入神经 络架构,特别是层次表示的架构,机器学习实践者在解决维度的诅咒方面取得了有效进展[7]。深度学习同样加速了RL的发展,RL内部的深度学习算法定义了DRL领域。本调查 告涵盖了DRL开创性的和近期的发展,传达了神经 络用于使我们更可能开发出自主Agent的新方法。为了更全面地调查近期在DRL方面的努力,我们建议读者参考Li[43]的概述。

Deep learning enables RL to scale to decision-making problems that were previously intractable, i.e., settings with high-dimensional state and action spaces. Among recent work in the field of DRL, there have been two outstanding success stories. The first, kick starting the revolution in DRL, was the development of an algorithm that could learn to play a range of Atari 2600 video games at a superhuman level, directly from image pixels [47]. Providing solutions for the instability of function approximation techniques in RL, this work was the first to convincingly demonstrate that RL agents could be trained on raw, high-dimensional observations, solely based on a reward signal. The second standout success was the development of a hybrid DRL system, AlphaGo, that defeated a human world champion in Go [73], paralleling the historic achievement of IBM’s Deep Blue in chess two decades earlier [9]. Unlike the handcrafted rules that have dominated chess-playing systems, AlphaGo comprised neural networks that were trained using supervised learning and RL, in combination with a traditional heuristic search algorithm.

深度学习使RL能够扩展到以前难以处理的决策问题,即具有高维状态和动作空间的环境。在DRL领域近期的工作中,有两个杰出的成功故事。首先,启动DRL革命的是一种算法的开发,该算法可以直接从图像像素学习以“superhuman level”的水平玩一系列雅达利2600的视频游戏[47]。这项工作为RL中函数近似技术的不稳定性提供了解决方案,它首次令人信服地证明了RL的Agent可以在原始的高维观察上仅基于奖励信 进行训练。第二个成就是开发了一种混合DRL系统——AlphaGo,它在围棋中击败了一位人类世界冠军,与20年前IBM的深蓝(Deep Blue)在国际象棋中的历史性成就相媲美[9]。与主宰国际象棋系统的手工规则不同,AlphaGo由神经 络组成,这些神经 络使用监督学习和RL,并结合传统的启发式搜索算法进行训练。

DRL algorithms have already been applied to a wide range of problems, such as robotics, where control policies for robots can now be learned directly from camera inputs in the real world [41], [42], succeeding controllers that used to be hand-engineered or learned from low-dimensional features of the robot’s state. In Figure 1, we showcase just some of the domains that DRL has been applied to, ranging from playing video games [47] to indoor navigation [100].

DRL算法已经应用于各种各样的问题,如机器人技术,机器人控制策略现在可以直接从摄像头输入的现实场景进行训练[41、42],接替过去手动设计或从机器人状态的低维特征中学习的控制器。在图1中,我们展示了DRL的一些应用领域,从玩电子游戏[47]到室内导航[100]。

声明:本站部分文章及图片源自用户投稿,如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢!

上一篇 2021年1月24日
下一篇 2021年1月24日

相关推荐