引用本文:唐昊,周雷,袁继彬.平均和折扣准则MDP基于TD(0)学习的统一NDP方法[J].控制理论与应用,2006,23(2):292~296.[点击复制]
TANG Hao,ZHOU Lei,YUAN Ji-bin.Unified NDP method based on TD(0) learning for both average and discounted Markov decision processes[J].Control Theory and Technology,2006,23(2):292~296.[点击复制]
平均和折扣准则MDP基于TD(0)学习的统一NDP方法
Unified NDP method based on TD(0) learning for both average and discounted Markov decision processes
摘要点击 1340  全文点击 1082  投稿时间:2004-09-25  修订日期:2005-07-14
查看全文  查看/发表评论  下载PDF阅读器
DOI编号  
  2006,23(2):292-296
中文关键词  Markov决策过程  性能势  TD(0)学习  神经元动态规划
英文关键词  Markov decision processes  performance potentials  TD(0) learning  neuro-dynamic programming
基金项目  国家自然科学基金资助项目(60404009);安徽省自然科学基金资助项目(050420303);合肥工业大学中青年科技创新群体计划资助项目
作者单位
唐昊,周雷,袁继彬 合肥工业大学计算机与信息学院,安徽合肥230009 
中文摘要
      为适应实际大规模Markov系统的需要,讨论Markov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynamic programming,NDP)优化方法.研究结果适用于半Markov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.
英文摘要
      Motivated by the need of practical large-scale Markov systems,we considered in this paper the learning optimization problems for Markov decision processes(MDPs).Based on the definition of performance potentials,a unified formula of temporal difference is provided for both average and discounted performance criteria.A neural network is then used to represent the estimation of potentials,both the parameterized TD(0) learning formulas and algorithm are also derived for approximating the policy evaluation.By the approximation values of potentials and approximation policy iteration,a unified neuro-dynamic programming(NDP) optimization approach is consequently proposed for both two criteria.The obtained results can be extended to semi-Markov decision processes,and a numerical example is finally used to illustrate the application of the proposed neuro-policy iteration algorithm for both average and discounted criteria.The example also shows that the average problem is the limitation case of the discount ones as discount factor goes to zero.