Markov decision processes (MDPs) are widely used to model stochastic systems to deduce optimal decision-making policies. As the transition probabilities are usually unknown in MDPs, simulation-based policy improvement (SBPI) using a base policy to derive optimal policies when the state transition probabilities are unknown is suggested. However, estimating the Q-value of each action to determine the best action in each state requires many simulations, which results in efficiency problems for SBPI. In this study, we propose a method to improve the overall efficiency of SBPI using optimal computing budget allocation (OCBA) based on accumulated samples. Previous works have mainly focused on improving SBPI efficiency for a single state and without using the previous simulation samples. In contrast, the proposed method improves the overall efficiency until an optimal policy can be found in consideration of the state traversal property of the SBPI. The proposed method accumulates simulation samples across states to estimate the unknown transition probabilities. These probabilities are then used to estimate the mean and variance of the Q-value for each action, which allows the OCBA to allocate the simulation budget efficiently to find the best action in each state. As the SBPI traverses the state, the accumulated samples allow appropriate allocation of OCBA; thus, the optimal policy can be obtained with a lower budget. The experimental results demonstrate the improved efficiency of the proposed method compared to previous works.
- Markov decision process
- optimal computing budget allocation
- simulation-based policy improvement
- stochastic system optimization