深度强化学习进展: 从AlphaGo到AlphaGo Zero

赵冬斌; 唐振韬; 邵坤; 朱圆恒

引用本文:	赵冬斌,唐振韬,邵坤,朱圆恒.深度强化学习进展: 从AlphaGo到AlphaGo Zero[J].控制理论与应用,2017,34(12):1529~1546.[点击复制]
	ZHAO Dong-bin,TANG Zhen-tao,SHAO Kun,ZHU Yuan-heng.Recent progress of deep reinforcement learning: from AlphaGo to AlphaGo Zero[J].Control Theory and Technology,2017,34(12):1529~1546.[点击复制]

深度强化学习进展: 从AlphaGo到AlphaGo Zero

Recent progress of deep reinforcement learning: from AlphaGo to AlphaGo Zero

摘要点击 10493 全文点击 7979 投稿时间：2017-11-06 修订日期：2018-01-19

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2017.70808

2017,34(12):1529-1546

中文关键词深度强化学习 AlphaGo Zero 深度学习强化学习人工智能

英文关键词 deep reinforcement learning AlphaGo Zero deep learning reinforcement learning artificial intelligence

基金项目国家自然科学基金项目(61603382, 61573353, 61533017)

作者	单位	E-mail
赵冬斌^*	中国科学院自动化研究所	dongbin.zhao@ia.ac.cn
唐振韬	中国科学院自动化研究所
邵坤	中国科学院自动化研究所
朱圆恒	中国科学院自动化研究所

中文摘要

2016年初, AlphaGo战胜李世石成为人工智能的里程碑事件. 其核心技术深度强化学习受到人们的广泛关注和研究, 取得了丰硕的理论和应用成果. 并进一步研发出算法形式更为简洁的AlphaGo Zero, 其采用完全不基于人类经验的自学习算法, 完胜AlphaGo, 再一次刷新人们对深度强化学习的认知. 深度强化学习结合了深度学习和强化学习的优势, 可以在复杂高维的状态动作空间中进行端到端的感知决策. 本文主要介绍了从AlphaGo到Alpha- Go Zero的深度强化学习的研究进展. 首先回顾对深度强化学习的成功作出突出贡献的主要算法, 包括深度Q网络算法、A3C算法、策略梯度算法及其他算法的相应扩展. 然后给出AlphaGo Zero的详细介绍和讨论, 分析其对人工智能的巨大推动作用. 并介绍了深度强化学习在游戏、机器人、自然语言处理、智能驾驶、智能医疗等领域的应用进展, 以及相关资源进展. 最后探讨了深度强化学习的发展展望, 以及对其他潜在领域的人工智能发展的启发意义.

英文摘要

In the early 2016, the defeat of Lee Sedol by AlphaGo became the milestone of artificial intelligence. Since then, deep reinforcement learning (DRL), which is the core technique of AlphaGo, has received widespread attention, and has gained fruitful results in both theory and applications. In the sequel, AlphaGo Zero, a simplified version of AlphaGo, masters the game of Go by self-play without human knowledge. As a result, AlphaGo Zero completely surpasses AlphaGo, and enriches humans’ understanding of DRL. DRL combines the advantages of deep learning and reinforcement learning, so it is able to perform well in high-dimensional state-action space, with an end-to-end structure combining perception and decision together. In this paper, we present a survey on the remarkable process made by DRL from AlphaGo to AlphaGo Zero. We first review the main algorithms that contribute to the great success of DRL, including DQN, A3C, policy-gradient, and other algorithms and their extensions. Then, detailed introduction and discussion on AlphaGo Zero are given and its great promotion on artificial intelligence is also analyze. The progress of applications with DRL in such areas as games, robotics, natural language processing, smart driving, intelligent health care, and related resources are also presented. In the end, we discuss the future development of DRL, and the inspiration on other potential areas related to artificial intelligence.