运筹与管理 ›› 2025, Vol. 34 ›› Issue (8): 185-191.DOI: 10.12005/orms.2025.0260

• 应用研究 • 上一篇    下一篇

面向连续泊位和岸桥动态调度问题的强化学习方法

王苓, 王钰, 梁承姬   

  1. 上海海事大学 物流科学与工程研究院,上海 201306
  • 收稿日期:2023-02-08 发布日期:2025-12-04
  • 通讯作者: 王钰(1989-),女,河北秦皇岛人,博士,讲师,研究方向:汽车物流,物流优化算法。Email: wangyu@shmtu.edu.cn。
  • 作者简介:王苓(1994-),女,山东费县人,硕士,研究方向:港口调度优化
  • 基金资助:
    上海市青年科技英才扬帆计划(21YF1416400);上海市“科技创新行动计划”软科学研究项目(22692111200)

A Reinforcement Learning Approach for Joint Optimization ofContinuous Berth Allocation and Quay Crane Scheduling

WANG Ling, WANG Yu, LIANG Chengji   

  1. Institute of Logistics Science and Engineering, Shanghai Maritime University, Shanghai 201306, China
  • Received:2023-02-08 Published:2025-12-04

摘要: 近年来集装箱吞吐量增加、作业设备智能化程度提升,使得港口对动态环境下泊位与岸桥联合调度的要求不断提高。为充分利用动态环境中的大量数据从而作出高效的优化决策,本文将连续泊位–岸桥调度问题考虑为序列决策问题,构建了相应的马尔可夫决策过程,提出了一种基于近端策略优化(Proximal Policy Optimization, PPO)的深度强化学习算法。该算法充分考虑了岸桥的动态移动、船舶动态抵港情况,设计了合理的状态空间、动作空间和奖励函数,算法通过与大规模复杂场景下的动态环境不断交互获得连续泊位–岸桥联合优化的最佳调度方案。多个算例的测试结果表明本文所提出的PPO算法能够充分适应不同问题规模和动态环境,相较于传统的调度决策方法更具优势:与遗传算法和粒子群算相比,其计算效率提升了93.21%和93.01%,决策目标平均改进了15.7%和20.3%;与DDPG强化学习算法相比在收敛速度上更快,通过对比在几组不同算例下的决策目标改进了6.5%~10%的作业时间。

关键词: 泊位岸桥调度, 深度强化学习, 连续泊位分配, 岸桥动态调度, PPO算法

Abstract: Container ports are important maritime hubs that connect trade between countries. In daily operations, ports need to allocate limited berth and quay crane resources to ships within a planning period based on multiple factors such as arrival time, ship type, workload, and departure plans, to ensure that all ships complete their operations as soon as possible. The decisions of berth allocation and quay crane scheduling usually rely on an intuition and experience of port staff, which can easily lead to a prolonged vessel stay in port or a waste of limited resources. Moreover, in the trend of large-sized ships, increasing port throughput, and the gradual popularization of intelligent equipment and systems, a large amount of operational data has been accumulated during daily operations. More scientific and intelligent decision-making methods are urgently required for berth and crane scheduling to further improve port operation efficiency and resources utilization, especially in complex large-scale dynamic environments.
The problem of port resources allocation and equipment scheduling has attracted the attention of many scholars, but existing research on berth allocation usually considers the coastline to be discrete into multiple berths, which cannot truly depict the actual situation of the coastline. And existing research on crane scheduling mostly considers static task sets and fails to meet the real-time scheduling needs of ports in complex environments. At present, many scholars have proposed various solutions to berth allocation and crane scheduling problems. But most of them consider the two related decisions separately, and ones using machine learning techniques only optimize the parameters in the traditional algorithms without taking full advantage of the adaptive and efficient characteristics of reinforcement learning.
In order to meet the needs of intelligent decision-making of port berth and crane scheduling in a large-scale complex dynamic environment, this paper considers a joint optimization problem of continuous berth allocation and quay crane dynamic scheduling. The problem is considered to be a sequencing one and is described as a Markov Decision Process (MDP) with carefully designed state space, action space, and reward functions. An efficient reinforcement learning method is proposed by combining A2C (Actor to Critical) neural network with Proximal Policy Optimization (PPO).
The PPO-based method is trained and tested based on real data from a port in Shanghai, China. Each episode of the training process contains 72 steps of interactions with the environment, and the agent updates the model every 100 steps. During the training process, the stage training results are saved every 1000 episodes. The training process stabilizes and converges after 60 thousand episodes. Based on test datasets of different scales, the proposed method can quickly generate optimal decisions for continuous berth allocation and quay crane dynamic scheduling. To show the performance of the proposed method, comparative experiments with the DDPG algorithm and three classic heuristics are conducted. Compared with DDPG, the proposed method can reduce the total working time of the ships by 3 to 10 hours, and has a better performance and efficiency of convergence in the training process. For 30 ships, compared with the genetic algorithm, PSO algorithm, and a first-come-first-serve heuristics, the proposed method reduces the total working time optimization objective by 15.7%, 20.3%, and 11%, respectively, and reduces decision time by about 93% compared to the two metaheuristics.
The PPO-based method proposed in this article can fully utilize historical data, obtain and update the best strategy through interactive training in a dynamic environment, and make effective decisions for continuous berth allocation and quay crane dynamic scheduling. It has more advantages in decision-making efficiency and optimization on objectives compared to traditional learning methods and is thus more in line with the current intelligent development requirements of ports. Future research could consider using reinforcement learning in multi-level equipment joint scheduling, and cooperative game of multiple agents.

Key words: berth and quay crane joint scheduling, deep reinforcement learning, continuous berth allocation, crane dynamic scheduling, PPO algorithm

中图分类号: