强化学习在运筹学的应用:研究进展与展望

doi:10.12005/orms.2020.0137

摘要/Abstract

摘要： 强化学习已经成为人工智能领域一个新的研究热点,并已成功应用于各领域,强化学习将运筹优化领域的很多问题视为序贯决策问题,建模为马尔可夫决策过程并进行求解,在求解复杂、动态、随机运筹优化问题具有较大的优势。本文主要对强化学习在运筹优化领域的应用进行综述,首先介绍了强化学习的基本原理及其应用于运筹优化领域的研究框架,然后回顾并总结了强化学习在库存控制、路径优化、装箱配载和车间作业调度等方面的研究成果,并将最新的深度强化学习以及传统方法在运筹学领域的应用研究进行了对比分析,以突出深度强化学习的优越性。最后提出几个值得进一步探讨的研究方向,期望能为强化学习在运筹优化领域的研究提供参考。

关键词: 强化学习, 运筹优化, 序贯决策, 马尔可夫决策过程, 深度强化学习

Abstract: Reinforcement learning has become a new research hotspot in the field of artificial intelligence, and has been successfully applied in various fields. Reinforcement learning regards many problems in the community of operational optimization as sequential decision problems, modeled as Markov decision processes, and then solve them. It has great advantages in solving complex, dynamic and random operation optimization problems. This paper mainly summarizes the application of reinforcement learning in the area of operational optimization. Firstly, it introduces the basic principles of reinforcement learning and its application framework in the field of operational optimization. Then it systematically reviews and summarizes the reinforcement learning in inventory control, path optimization, packing and loading and job shop scheduling. And the latest deep reinforcement learning and the application of traditional methods in the field of operations research are compared and analyzed to highlight the superiority of deep reinforcement learning. Finally, several research directions worthy of further discussion are proposed, and it is expected to provide reference for the study of reinforcement learning in the field of operational optimization.

Key words: reinforcement learning, operation and optimization, sequential decision, Markov decision process, deep reinforcement learning

中图分类号:

TP18
O22

徐翔斌, 李志鹏. 强化学习在运筹学的应用:研究进展与展望[J]. 运筹与管理, 2020, 29(5): 227-239.

XU Xiang-bin, LI Zhi-peng. Research Progress and Prospects for Application of Reinforcement Learning in Operations Research[J]. Operations Research and Management Science, 2020, 29(5): 227-239.

参考文献

[1] 高阳,陈世福,陆鑫.强化学习研究综述[J].自动化学报,2004,30(1):86-100.
[2] Kaelbling L P, Littman M L, Moore A W. Reinforcement learning: a survey[J]. J Artificial Intelligence Research, 1996, 4(1): 237-285.
[3] Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J]. arXiv preprint arXiv:13125602, 2013.
[4] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[5] Kober J, Peters J. Reinforcement learning in robotics: a survey[J]. International Journal of Robotics Research, 2013, 32(11): 1238-1274.
[6] Yogatama D, Blunsom P, Dyer C, Grefenstette E, Ling W. Learning to compose words into sentences with reinforcement learning[J]. arXiv preprint arXiv:161109100, 2016.
[7] Li L, Chu W, Langford J, et al. A contextual-bandit approach to personalized news article recommendation[C]//Proceedings of the 19th international conference on World wide web. ACM, 2010: 661-670.
[8] 余凯,贾磊,陈雨强,徐伟.深度学习的昨天、今天和明天[J].计算机研究与发展,2013,50(9):1799-1804.
[9] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489.
[10] Sutton R S, Barto A G. Reinforcement learning: an introduction[M]. MIT press, 2018.
[11] Levine S, Koltun V. Guided policy search[C]//International Conference on Machine Learning. 2013. 1-9.
[12] Deisenroth M, Rasmussen C E. PILCO: A model-based and data-efficient approach to policy search[C]//Proceedings of the 28th International Conference on machine learning (ICML-11). 2011: 465-472.
[13] Watkins C J C H, Dayan P. Q-learning[J]. Machine learning, 1992, 8(3-4): 279-292.
[14] Spall J C. Estimation via markov chain monte carlo[J]. Control Systems IEEE, 2002, 23(2): 34-45.
[15] Sutton R S. Learning to predict by the methods of temporal differences[J]. Machine Learning, 1988, 3(1): 9-44.
[16] Li Y. Deep reinforcement learning: an overview[J]. arXiv preprint arXiv:170107274, 2017.
[17] Arjona-Medina J A, Gillhofer M, Widrich M, et al. RUDDER: return decomposition for delayed rewards[J]. arXiv preprint arXiv:1806.07857, 2018.
[18] Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3-4): 229-256.
[19] Konda V. Actor-critic algorithms[J]. Siam Journal on Control and Optimization, 2003, 42(4): 1143-1166.
[20] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[C]//International conference on machine learning. 2016. 1928-1937.
[21] Gendreau M, Laporte G, Séguin R. An exact algorithm for the vehicle routing problem with stochastic demands and customers[M]. INFORMS, 1995.
[22] 杨华龙,叶迪,张倩,曾庆成.时间窗变动的车辆调度干扰管理模型与算法[J].运筹与管理,2017,(10):56-64.
[23] 郑斐峰,梅启煌,刘明,张小宁.基于遗传算法与贪婪策略的多港口集装箱配载研究[J]. 运筹与管理,2018,27(5):1-7.
[24] 邰世文,商剑平.煤炭码头卸车调度问题多目标优化模型及算法[J].运筹与管理,2018,27(6):91-99.
[25] Puterman M L. Markov decision processes: discrete stochastic dynamic programming[M]. John Wiley & Sons, 2014.
[26] Van Roy B, Bertsekas D P, Lee Y, Tsitsiklis J N. A neuro-dynamic programming approach to retailer inventory management[C]//Decision and Control, 1997., Proceedings of the 36th IEEE Conference on. IEEE, 1997, 4: 4052-4057.
[27] Kim C O, Jun J, Baek J K, Smith R L, Kim Y D. Adaptive inventory control models for supply chain management[J]. The International Journal of Advanced Manufacturing Technology, 2004, 26(9-10): 1184-1192.
[28] Kim C O, Kwon I H, Baek J G. Asynchronous action-reward learning for nonstationary serial supply chain inventory control[J]. Applied Intelligence, 2007, 28(1): 1-16.
[29] Kwon I H, Chang O K, Jin J, Lee J H. Case-based myopic reinforcement learning for satisfying target service level in supply chain[J]. Expert Systems with Applications, 2008, 35(1-2): 389-397.
[30] Jiang C, Sheng Z. Case-based reinforcement learning for dynamic inventory control in a multi-agent supply-chain system[J]. Expert Systems with Applications, 2009, 36(3): 6520-6526.
[31] Giannoccaro I, Pontrandolfo P. Inventory management in supply chains: a reinforcement learning approach[J]. International Journal of Production Economics, 2002, 78(2): 153-161.
[32] Chaharsooghi S K, Heydari J, Zegordi S H. A reinforcement learning model for supply chain ordering management: an application to the beer game[J]. Decision Support Systems, 2008, 45(4): 949-959.
[33] Tongeren T V, Kaymak U, Naso D, Asperen E V. Q-learning in a competitive supply chain[C]//Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on. IEEE, 2007. 1211-1216.
[34] Mortazavi A, Arshadi Khamseh A, Azimi P. Designing of an intelligent self-adaptive model for supply chain ordering management system[J]. Engineering Applications of Artificial Intelligence, 2015, 37: 207-220.
[35] Oroojlooyjadid A, Nazari M, Snyder L, Taká M. A deep q-network for the beer game with partial information[J]. arXiv preprint arXiv:170805924, 2017.
[36] 王皓,高阳,陈兴国.强化学习中的迁移:方法和进展[J].电子学报,2008,36(s1):39-43.
[37] Gijsbrechts J, Boute R N, Van Mieghem J A, Zhang D. Can deep reinforcement learning improve inventory management? performance and implementation of dual sourcing-mode problems[J]. Performance and Implementation of Dual Sourcing-Mode Problems(December 17, 2018), 2018.
[38] Valluri A, North M J, Macal C M. Reinforcement learning in supply chains[J]. International Journal of Neural Systems, 2009, 19(05): 331-344.
[39] Sun R, Zhao G, Yin C. A multi-agent coordination of a supply chain ordering management with multiple members using reinforcement learning[C]//Industrial Informatics(INDIN), 2010 8th IEEE International Conference on. IEEE, 2010: 612-616.
[40] Kim C O, Kwon I-H, Kwak C. Multi-agent based distributed inventory control model[J]. Expert Systems with Applications, 2010, 37(7): 5186-5191.
[41] Kwak C, Choi J S, Kim C O, Kwon I H. Situation reactive approach to vendor managed inventory problem[J]. Expert Systems with Applications, 2009, 36(5): 9039-9045.
[42] Yang S, Zhang J. Adaptive inventory control and bullwhip effect analysis for supply chains with non-stationary demand[C]//Chinese Contwol and Deciaion Conlesence, 2015. 3903-3908.
[43] Sui Z, Gosavi A, Lin L. A reinforcement learning approach for inventory replenishment in vendor-managed inventory systems with consignment inventory[J]. Engineering Management Journal, 2010, 22(4): 44-53.
[44] Katanyukul T, Duff W S, Chong E K P. Approximate dynamic programming for an inventory problem: empirical comparison[J]. Computers & Industrial Engineering, 2011, 60(4): 719-743.
[45] Rummery G A, Niranjan M. On-line Q-learning using connectionist systems[M]. Cambridge, England: University of Cambridge, Department of Engineering, 1994.
[46] Bertsekas D P, Tsitsiklis J N. Neuro-dynamic programming: an overview[C]//Proceedings of the 34th IEEE Conference on Decision and Control. Piscataway, NJ: IEEE Publ., 1995, 1: 560-564.
[47] Kara A, Dogan I. Reinforcement learning approaches for specifying ordering policies of perishable inventory systems[J]. Expert Systems with Applications, 2018, 91: 150-158.
[48] Dantzig G B, Ramser J H. The truck dispatching problem[J]. Management Science, 1959, 6(1): 80-91.
[49] Garey M R, Johnson D S. Computers and intractability: a guide to the theory of NP-completeness[M]. W. H. Freeman, 1986.
[50] Psaraftis H N. A dynamic programming solution to the single vehicle many-to-many immediate request dial-a-ride problem[J]. Transportation Science, 1980, 14(2): 130-154.
[51] Secomandi N. Analysis of a rollout approach to sequencing problems with stochastic routing applications[J]. Journal of Heuristics, 2003, 9(4): 321-352.
[52] Toriello A, Haskell W B, Poremba M. A dynamic traveling salesman problem with stochastic arc costs[J]. Operations Research, 2014, (62): 1107-1125.
[53] Bello I, Pham H, Le Q V, Norouzi M, Bengio S. Neural combinatorial optimization with reinforcement learning[J]. arXiv preprint arXiv:161109940, 2016.
[54] Vinyals O, Fortunato M, Jaitly N. Pointer networks[C]//Advances in Neural Information Processing Systems, 2015: 2692-2700.
[55] Dai H, Khalil E B, Zhang Y, Dilkina B, Song L. Learning combinatorial optimization algorithms over graphs[J]. arXiv preprint arXiv: 170401665, 2017.
[56] Riedmiller M. Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method[C]//European Conference on Machine Learning. Springer, Berlin, Heidelberg, 2005. 317-328.
[57] Bengio Y, Lodi A, Prouvost A. Machine learning for combinatorial optimization: a methodological tour d’horizon[J]. arXiv preprint arXiv:181106128, 2018.
[58] Emami P, Ranka S. Learning permutations with sinkhorn policy gradient[J]. arXiv preprint arXiv:180507010, 2018.
[59] Mena G, Belanger D, Linderman S, Snoek J. Learning latent permutations with gumbel-sinkhorn networks[J]. arXiv preprint arXiv:1802.08665, 2018.
[60] Kool W, Welling M. Attention solves your TSP[J]. arXiv preprint arXiv:180308475, 2018.
[61] Deudon M, Cournut P, Lacoste A, Adulyasak Y, Rousseau L M. Learning heuristics for the TSP by policy gradient[C]//International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research. Springer, Cham, 2018.170-181.
[62] Laporte G, Louveaux F V, Hamme L V. An integer l-shaped algorithm for the capacitated vehicle routing problem with stochastic demands[J]. Operations Research, 2002, 50(3): 415-423.
[63] Dror M, Laporte G, Trudeau P. Vehicle routing with stochastic demands: properties and solution frameworks[J]. Transportation Science, 1989, 23(3): 166-176.
[64] Secomandi N. Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands[J]. Computers & Operations Research, 2000, 27(11-12): 1201-1225.
[65] Novoa C, Storer R. An approximate dynamic programming approach for the vehicle routing problem with stochastic demands[J]. European Journal of Operational Research, 2009, 196(2): 509-515.
[66] Goodson J C, Thomas B W, Ohlmann J W. Restocking-based rollout policies for the vehicle routing problem with stochastic demand and duration limits[J]. Transportation Science, 2015, 50(2): 591-607.
[67] Bertazzi L, Secomandi N. Faster rollout search for the vehicle routing problem with stochastic demands and restocking[J]. European Journal of Operational Research, 2018, 270(2): 487-497.
[68] Tan Y. Automated scheduling: reinforcement learning approach to algorithm policy learning[C]//Advances in Artificial Intelligence: 31st Canadian Conference on Artificial Intelligence, Canadian AI 2018, Toronto, ON, Canada, May 8–11, 2018, Proceedings 31. Springer International Publishing, 2018. 335-338.
[69] Fan J, Wang X, Ning H. A multiple vehicles routing problem algorithm with stochastic demand[C]//Intelligent Control and Automation, 2006. WCICA 2006. The Sixth World Congress on. IEEE, 2006, 1: 1688-1692.
[70] 娄山佐,吴耀华,肖际伟,廖莉.基于增强学习解决随机需求车辆路径问题[J].系统仿真学报,2008,20(14):3675-3678.
[71] Zhang C C, Dellaert N N, Zhao L L, Woensel V T T, Sever D D. Single vehicle routing with stochastic demands: approximate dynamic programming[J]. Dept. Ind. Eng., Tsinghua Univ., Beijing, China, Tech. Rep, 2013. 1-37.
[72] Nazari M, Oroojlooy A, Snyder L V, Taká M. Deep reinforcement learning for solving the vehicle routing problem[J]. arXiv preprint arXiv:180204240, 2018.
[73] Meisel S, Suppa U, Mattfeld D. Serving multiple urban areas with stochastic customer requests[M]. Springer Berlin Heidelberg, 2011.
[74] Ulmer M W, Mattfeld D C, Hennig M, Goodson J C. A rollout algorithm for vehicle routing with stochastic customer requests[M]. Springer International Publishing, 2016.
[75] Ulmer M W, Goodson J C, Mattfeld D C, Hennig M. Offline-online approximate dynamic programming for dynamic vehicle routing with stochastic requests[J]. Transportation Science, 2018.
[76] Ulmer M W, Mattfeld D C, Soeffker N. Dynamic multi-period vehicle routing: approximate value iteration based on dynamic lookup tables[J]. Technical Report, Technische University; t Braunschweig, 2016. 1-28.
[77] Nasrollahzadeh A A, Khademi A, Mayorga M E. Real-time ambulance dispatching and relocation[J]. Manufacturing & Service Operations Management, 2018.
[78] Maxwell M S, Restrepo M, Henderson S G, Topaloglu H. Approximate dynamic programming for ambulance redeployment[J]. INFORMS Journal on Computing, 2010, 22(2): 266-281.
[79] Kellerer H, Pferschy U, Pisinger D. Knapsack problems[M]. Springer Berlin Heidelberg, 2004.
[80] Garey M R, Johnson D S. Approximation algorithms for bin packing problems: a survey[M]. Springer Vienna, 1981.
[81] Kang J, Park S. Algorithms for the variable sized bin packing problem[J]. European Journal of Operational Research, 2003, 147(2): 365-372.
[82] Christensen H I, Khan A, Pokutta S, Tetali P. Approximation and online algorithms for multidimensional bin packing: a survey[J]. Computer Science Review, 2017, 24: 63-79.
[83] Kleywegt A J, Papastavrou J D. The dynamic and stochastic knapsack problem[M]. INFORMS, 1998.
[84] Kleywegt A J, Papastavrou J D. The dynamic and stochastic knapsack problem with random sized items[J]. Operations Research, 2001, 49(1): 26-41.
[85] Mastin A, Jaillet P. Average-case performance of rollout algorithms for knapsack problems[J]. Journal of Optimization Theory & Applications, 2014, 165(3): 1-21.
[86] Kong W, Liaw C, Mehta A, Sivakumar D. A new dog learns old tricks: RL finds classic optimization algorithms[C]//Intemational Conlewence on Leawning Retwesentations, 2019: 1-25.
[87] Bertsimas D, Demir R. An approximate dynamic programming approach to multidimensional knapsack problems[J]. Management Science, 2002, 48(4): 550-565.
[88] Perry T C, Hartman J C. An approximate dynamic programming approach to solving a dynamic, stochastic multiple knapsack problem[J]. International Transactions in Operational Research, 2009, 16(3): 347-359.
[89] Hua Z, Zhang B, Liang L. An approximate dynamic programming approach to convex quadratic knapsack problems[J]. Computers & Operations Research, 2006, 33(3): 660-673.
[90] Goodson J C, Thomas B W, Ohlmann J W. A rollout algorithm framework for heuristic solutions to finite-horizon stochastic dynamic programs[J]. European Journal of Operational Research, 2017, 258(1): 216-229.
[91] Hu H, Zhang X, Yan X, Wang L, Xu Y. Solving a new 3d bin packing problem with deep reinforcement learning method[J]. arXiv preprint arXiv:170805930, 2017.
[92] Hu H, Duan L, Zhang X, Xu Y, Wei J. A multi-task selected learning approach for solving new type 3D bin packing problem[J]. arXiv preprint arXiv:180406896, 2018.
[93] Laterre A, Fu Y, Jabri M K, et al. Ranked reward: enabling self-play reinforcement learning for combinatorial optimization[J]. arXiv preprint arXiv:180701672, 2018.
[94] Browne C B, Powley E, Whitehouse D, et al. A survey of monte carlo tree search methods[J]. IEEE Transactions on Computational Intelligence and AI in games, 2012, 4(1): 1-43.
[95] Zhang W, Dietterich T G. A reinforcement learning approach to job-shop scheduling[C]//IJCAI. 1995, 95: 1114-1120.
[96] Riedmiller S, Riedmiller M. A neural reinforcement learning approach to learn local dispatching policies in production scheduling[C]//IJCAI. 1999, 2: 764-771.
[97] Aydin M E, Ztemel E. Dynamic job-shop scheduling using reinforcement learning agents[J]. Robotics & Autonomous Systems, 2000, 33(2): 169-178.
[98] 魏英姿,赵明扬.一种基于强化学习的作业车间动态调度方法[J].自动化学报,2005,31(5):765-771.
[99] Chen X, Hao X C, Lin H W, Murata T. Rule driven multi objective dynamic scheduling by data envelopment analysis and reinforcement learning[C]//Automation and Logistics(ICAL), 2010 IEEE International Conference on. IEEE, 2010. 396-401.
[100] 魏英姿,谷侃锋.基于性能预测的遗传强化学习动态调度方法[J].系统仿真学报,2010,22(12):2809-2812.
[101] Shahrabi J, Adibi M A, Mahootchi M. A reinforcement learning approach to parameter estimation in dynamic job shop scheduling[J]. Computers & Industrial Engineering, 2017, 110: 75-82.
[102] 潘燕春,冯允成,周泓,魏佳呈.强化学习和仿真相结合的车间作业排序系统[J].控制与决策,2007,22(6):675-679.
[103] Gabel T, Riedmiller M. Distributed policy search reinforcement learning for job-shop scheduling tasks[J]. International Journal of Production Research, 2012, 50(1): 41-61.
[104] Qu S, Jie W, Govil S, Leckie J O. Optimized adaptive scheduling of a manufacturing process system with multi-skill workforce and multiple machine types: an ontology-based, multi-agent reinforcement learning approach[J]. Procedia Cirp, 2016, 57: 55-60.
[105] Mao H, Alizadeh M, Menache I, Kandula S. Resource management with deep reinforcement learning[C]//Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, 2016. 50-56.
[106] Chen W, Xu Y, Wu X. Deep reinforcement learning for multi-resource multi-machine job scheduling[J]. arXiv preprint arXiv: 171107440, 2017.
[107] Waschneck B, Reichstaller A, Belzner L, et al. Optimization of global production scheduling with deep reinforcement learning[J]. Procedia CIRP, 2018, 72: 1264-1269.
[108] Chen X, Tian Y. Learning to progressively plan[J]. arXiv preprint arXiv:181000337, 2018.
[109] Wang Y C, Usher J M. Learning policies for single machine job dispatching[J]. Robotics & Computer Integrated Manufacturing, 2004, 20(6): 553-562.
[110] 王世进,孙晟,周炳海,奚立峰.基于Q-学习的动态单机调度[J].上海交通大学学报,2007,41(8):1227-1232.
[111] Xanthopoulos A S, Koulouriotis D E, Tourassis V D, Emiris D M. Intelligent controllers for bi-objective dynamic scheduling on a single machine with sequence-dependent setups[J]. Applied Soft Computing, 2013, 13(12): 4704-4717.
[112] 王国磊,钟诗胜,林琳.基于聚类状态隶属度的动态调度Q-学习[J].高技术通讯,2009,19(4):428-433.
[113] Clark A J, Scarf H. Optimal policies for a multi-echelon inventory problem[J]. Management Sciece, 1960, 50(12): 1782-1790.
[114] Sterman J D. Modeling managerial behavior: misperceptions of feedback in a dynamic decision making experiment[J]. Management science, 1989, 35(3): 321-339.
[115] Dantzig G B. Discrete-variable extremum problems[J]. Operations Research, 1957, 5(2): 266-277.
[116] Grandl R, Ananthanarayanan G, Kandula S, Rao S, Akella A. Multi-resource packing for cluster schedulers[J]. ACM SIGCOMM Computer Communication Review, 2015, 44(4): 455-466.
[117] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks[J]. arXiv preprint arXiv: 170303400, 2017.