采用资格迹的神经网络学习控制算法

刘智斌; 曾晓勤; 徐彦; 禹继国

引用本文:	刘智斌,曾晓勤,徐彦,禹继国.采用资格迹的神经网络学习控制算法[J].控制理论与应用,2015,32(7):887~894.[点击复制]
	LIU Zhi-bin,ZENG Xiao-qin,XU Yan,YU Ji-guo.Learning to control by neural networks using eligibility traces[J].Control Theory and Technology,2015,32(7):887~894.[点击复制]

采用资格迹的神经网络学习控制算法

Learning to control by neural networks using eligibility traces

摘要点击 4458 全文点击 1472 投稿时间：2014-04-27 修订日期：2015-04-10

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2015.40367

2015,32(7):887-894

中文关键词强化学习神经网络资格迹倒立摆梯度下降

英文关键词 reinforcement learning neural networks eligibility traces cart-pole system gradient descent

基金项目国家自然科学基金项目(61403205, 61373027, 60117089), 曲阜师范大学实验室开放基金项目(sk201415)资助.

作者	单位	E-mail
刘智斌^*	曲阜师范大学信息科学与工程学院	lzbxian@163.com
曾晓勤	河海大学计算机与信息学院
徐彦	南京农业大学信息科技学院
禹继国	曲阜师范大学信息科学与工程学院

中文摘要

强化学习是解决自适应问题的重要方法, 被广泛地应用于连续状态下的学习控制, 然而存在效率不高和收敛速度较慢的问题. 在运用反向传播(back propagation, BP)神经网络基础上, 结合资格迹方法提出一种算法, 实现了强化学习过程的多步更新. 解决了输出层的局部梯度向隐层节点的反向传播问题, 从而实现了神经网络隐层权值的快速更新, 并提供一个算法描述. 提出了一种改进的残差法, 在神经网络的训练过程中将各层权值进行线性优化加权, 既获得了梯度下降法的学习速度又获得了残差梯度法的收敛性能, 将其应用于神经网络隐层的权值更新, 改善了值函数的收敛性能. 通过一个倒立摆平衡系统仿真实验, 对算法进行了验证和分析. 结果显示, 经过较短时间的学习, 本方法能成功地控制倒立摆, 显著提高了学习效率.

英文摘要

Reinforcement learning is an important approach to solve the adaptive learning control problems in continuous state space. However, it is bedeviled by its low learning efficiency and low convergence rate. In order to eliminate those deficiencies, based on back propagation (BP) neural networks and eligibility traces, we propose a learning algorithm with a complete description to achieve the multi-step updates in the process of reinforced learning to realize the counter propagation of the local gradient from output layer nodes to hidden layer nodes; thus, rapidly adjusting the weights of hidden layers. During the training processes of neural networks, a modified residual method is employed to optimize the weights in each layer by linear combination, achieving the rapid learning rate of the direct gradient method as well as the desired convergence properties of the residual gradient method. Applying this method to update the weights of hidden layers in a neural network, we improve the convergence properties of value functions. A cart-pole system is adopted for testing the application results of the above mentioned algorithms. Simulation results show that all our algorithms can successfully achieve the control for the cart-pole balancing system and improve the learning efficiency significantly.