INFORMS Journal on Computing
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


INFORMS JOURNAL ON COMPUTING
Vol. 19, No. 2, Spring 2007, pp. 161-174
DOI: 10.1287/ijoc.1050.0155
This Article
Right arrow Full Text (PDF)
Right arrow References
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Hu, J.
Right arrow Articles by Marcus, S. I.
Right arrow Search for Related Content

An Evolutionary Random Policy Search Algorithm for Solving Markov Decision Processes

Jiaqiao Hu, Michael C. Fu, Vahid R. Ramezani, Steven I. Marcus

Department of Applied Mathematics and Statistics, State University of New York at Stony Brook, Stony Brook, New York 11794, USA
Robert H. Smith School of Business and Institute for Systems Research, University of Maryland, College Park, Maryland 20742, USA
Institute for Systems Research, University of Maryland, College Park, Maryland 20742, USA
Department of Electrical and Computer Engineering, and Institute for Systems Research, University of Maryland, College Park, Maryland 20742, USA

jqhu{at}ams.sunysb.edu
mfu{at}rhsmith.umd.edu
rvahid{at}isr.umd.edu
marcus{at}umd.edu

This paper presents a new randomized search method called evolutionary random policy search (ERPS) for solving infinite-horizon discounted-cost Markov-decision-process (MDP) problems. The algorithm is particularly targeted at problems with large or uncountable action spaces. ERPS approaches a given MDP by iteratively dividing it into a sequence of smaller, random, sub-MDP problems based on information obtained from random sampling of the entire action space and local search. Each sub-MDP is then solved approximately by using a variant of the standard policy-improvement technique, where an elite policy is obtained. We show that the sequence of elite policies converges to an optimal policy with probability one. Some numerical studies are carried out to illustrate the algorithm and compare it with existing procedures.

Key words: dynamic programming; Markov; finite state; analysis of algorithms; programming; nonlinear; queues
History: received April 2004; revised June 2005; accepted June 2005.







HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2007 by INFORMS.