运筹与管理 ›› 2025, Vol. 34 ›› Issue (10): 66-72.DOI: 10.12005/orms.2025.0310

• 理论分析与方法探讨 • 上一篇    下一篇

基于Q学习求解限定性集装箱翻箱问题的优化算法研究

王文杰, 胡志华, 田曦丹   

  1. 上海海事大学 物流研究中心,上海 201306
  • 收稿日期:2023-11-01 出版日期:2025-10-25 发布日期:2026-02-27
  • 通讯作者: 胡志华(1977-),男,湖南长沙人,博士,教授,博士生导师,研究方向:港航与物流运作优化,计算智能与计算实验。Email: zhhu@shmtu.edu.cn。
  • 作者简介:王文杰(2000-),男,山西大同人,硕士研究生,研究方向:集装箱码头运作优化。
  • 基金资助:
    国家重点研发计划项目(2023YFE0113200)

Optimization Algorithms for Solving Restricted Container Relocation Problem Based on Q-learning

WANG Wenjie, HU Zhihua, TIAN Xidan   

  1. Logistics Research Center, Shanghai Maritime University, Shanghai 201306, China
  • Received:2023-11-01 Online:2025-10-25 Published:2026-02-27

摘要: 集装箱堆场的运行效率对整个港口的营运水平有极大的影响,为快速生成翻箱方案,提升集装箱堆场的翻箱效率,提出了一种基于ε-贪心策略的免模型启发式Q学习算法。算法以限定性单优先级集装箱翻箱问题的马尔可夫决策过程模型为基础,设计了一种集装箱堆场贝位布局的特征提取方式以实现算法环境状态的约减,同时确定了智能体动作奖励机制并引入了一种启发式动作选择规则来优化算法的寻优能力。通过仿真算例进行实验,结果表明,在集装箱数量为36~85规模的实例中,与数学规划和分支定界算法求得的最优翻箱策略相比,启发式Q学习算法的翻箱步数改进度为1%~9%;在集装箱数量为36~64规模的实例中,求解时间缩短了36%~65%;此外,在集装箱数量为18,50,64,85规模的实例中泛化测度为0.97~1.00,证明算法有较好的泛化性能。

关键词: 集装箱港口运作优化, 翻箱问题, 马尔可夫决策过程, Q学习

Abstract: The container relocation problem in container terminal yards affects the container terminal operations efficiency greatly. Inefficient relocation strategies prolong waiting times for customers and increase the operations costs in container yards. Additionally, to increase the container yard management performance, the container relocation schedulers must be generated in a real-time manner. The continuously increasing throughputs of container terminals challenge the speed and quality of the relocation strategy generation procedure. This paper focuses on the restricted container relocation problem with distinct priorities (RCRP). In RCRP, only the container on the top of a given stack can be moved to other stacks, when a container must be retrieved with higher priority than some other containers in the stack, under the minimization of the number of relocation steps.
To address this unique problem, a model-free heuristic Q-learning algorithm is devised based on the ε-greedy strategy. First, RCRP is formulated as a Markov decision process, defining the state space, action space, and state transition process suitable for reinforcement learning methods. A method is developed to extract state features from the bay to reduce the original state space, avoiding introducing too many dimensions. A reward mechanism is tailored to the problem. Further, an environment and an agent are constructed using a model-free heuristic Q-learning method, where the Q-table is updated considering off-policy temporal difference learning. The action selection strategies in each phase are established to enhance the training performance and decision-making capacity of the agents. A greedy strategy is used to improve exploration and convergence capabilities during the agent learning process. In the agent decision-making process, heuristic rules are employed to prevent the agent from making random predictions under insufficient information, thereby enhancing the agent ability to obtain optimal solutions. An ablation experiment, a performance experiment, and a generalization experiment are conducted to evaluate the effectiveness, solution performance, and generalization capability of the algorithm using eight evaluation criteria.
As is found through the numerical experiments with 280 randomly generated test cases of different sizes, the average prediction time of the Q-learning algorithm with heuristic rules for solving the 40 instances with the dimension 8×7 increases by 1.67 times compared to the Q-learning algorithm without heuristic rules; the steps used to generate the relocation strategies by the Q-learning algorithm with heuristic rules account for 34% of the Q-learning algorithm without heuristic rules. There is no significant difference in time for training the two algorithms. Within an acceptable increase in solution time, the Q-learning algorithm with heuristic rules produces significantly higher-quality solutions than the Q-learning algorithm without heuristic rules. Thus, incorporating heuristic rules to the Q-learning algorithm can help improve the quality of Q-learning algorithm solutions significantly. The heuristic Q-learning algorithm and the branch-and-bound algorithm can solve the instances of various sizes, using almost the same relocation steps (similarly between 0.91 and 0.99). The quality of solutions is either consistent or slightly superior to the branch-and-bound algorithm. In terms of computing time, except for small and extremely large instances, the efficiency difference rate is approximately 36~86% compared to the branch-and-bound algorithm. The agent training time is long for large-scale problems, ranging from 113 to 320 CPUs, and the time is acceptable in practices. In instances of 18, 50, 64, and 85 containers, the generalization metric of the average number of relocation steps obtained by the branch-and-bound and heuristic Q-learning algorithms ranges from 0.97 to 1.00. There is no statistically significant difference between the results obtained by the heuristic Q-learning algorithm and the branch-and-bound algorithm. The Q-learning algorithm proposed in this paper presents a good algorithm generalization performance. As a reinforcement learning method, the devised algorithm can achieve satisfactory results when solving problems of the same dimensions. The instances of various dimensions require training different agents for the solution algorithms. Additionally, more iterations during agent training are essential to achieving better prediction performance. In future research, it is worthwhile to design more suitable environment state extraction methods to make the environment state more concise and improve the computational performance of the algorithm. Furthermore, the action selection strategies during agent training and prediction can enhance the learning effectiveness of the agent.

Key words: container terminal operations optimization, container relocation problem, Markov decision process, Q-learning

中图分类号: