引用本文:张俊玉,吴怡婷,夏俐,曹希仁.可数状态空间的平均成本马氏决策过程(英文)[J].控制理论与应用,2021,38(11):1707~1716.[点击复制]
ZHANG Jun-yu,WU Yi-ting,XIA Li,CAO Xi-Ren.Average cost Markov decision processes with countable state spaces[J].Control Theory and Technology,2021,38(11):1707~1716.[点击复制]
可数状态空间的平均成本马氏决策过程(英文)
Average cost Markov decision processes with countable state spaces
摘要点击 1877  全文点击 419  投稿时间:2021-08-20  修订日期:2021-11-16
查看全文  查看/发表评论  下载PDF阅读器
DOI编号  10.7641/CTA.2021.10763
  2021,38(11):1707-1716
中文关键词  马尔可夫决策过程  平均准则  可数状态空间  Dynkin公式  泊松方程  性能敏感
英文关键词  Markov decision process  long-run average  countable state spaces  Dynkin’s formula  Poisson equation  performance sensitivity
基金项目  Supported by the National Natural Science Foundation of China (61673019, 61773411, 11931018, 62073346), the Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University (2020B1212060032) and the Guangdong Basic and Applied Basic Research Foundation (2021A1515010057, 2021A1515011984).
作者单位邮编
张俊玉 中山大学 510275
吴怡婷 中山大学 
夏俐 中山大学 510275
曹希仁* 香港科技大学 
中文摘要
      具有可数状态空间的马尔可夫决策过程(Markov decision process, MDP)在平均准则下, 最优(平稳)策略不一定 存在. 本文研究平均准则可数状态MDP中满足最优不等式的最优策略. 不同于消去折扣(因子)方法, 利用离散的 Dynkin公式推导本文的主要结果. 首先给出遍历马氏链的泊松方程和两个零常返马氏链的例子, 证明了满足两个方向 相反的最优不等式的最优策略存在性. 其次, 通过两个比较引理和性能差分公式, 证明了正常返链和多链最优策略的存 在性, 并进一步推广到其他情形. 特别地, 本文通过几个应用举例, 说明平均准则性能敏感的本质. 本文的结果完善了可 数状态MDP在平均准则下的最优不等式的理论.
英文摘要
      For the long-run average of a Markov decision process (MDP) with countable state spaces, the optimal (stationary) policy may not exist. In this paper, we study the optimal policies satisfying optimality inequality in a countable-state MDP under the long-run average criterion. Different from the vanishing discount approach, we use the discrete Dynkin’s formula to derive the main results of this paper. We first provide the Poisson equation of an ergodic Markov chain and two instructive examples about null recurrent Markov chains, and demonstrate the existence of optimal policies for two optimality inequalities with opposite directions. Then, from two comparison lemmas and the performance difference formula, we prove the existence of optimal policies under positive recurrent chains and multi-chains, which is further extended to other situations. Especially, several examples of applications are provided to illustrate the essential of performance sensitivity of the long-run average. Our results make a supplement to the literature work on the optimality inequality of average MDPs with countable states.