Operations Research and Management Science ›› 2020, Vol. 29 ›› Issue (5): 227-239.DOI: 10.12005/orms.2020.0137
• Overview • Previous Articles
XU Xiang-bin, LI Zhi-peng
Received:
2018-12-27
Online:
2020-05-25
徐翔斌, 李志鹏
作者简介:
徐翔斌(1975-),男,江西湖口人,教授,博士,主要研究方向为物流与供应链管理;李志鹏(1995-),男,江西鄱阳人,硕士研究生,主要研究方向为强化学习与运筹优化。
基金资助:
CLC Number:
XU Xiang-bin, LI Zhi-peng. Research Progress and Prospects for Application of Reinforcement Learning in Operations Research[J]. Operations Research and Management Science, 2020, 29(5): 227-239.
徐翔斌, 李志鹏. 强化学习在运筹学的应用:研究进展与展望[J]. 运筹与管理, 2020, 29(5): 227-239.
[1] 高阳,陈世福,陆鑫.强化学习研究综述[J].自动化学报,2004,30(1):86-100. [2] Kaelbling L P, Littman M L, Moore A W. Reinforcement learning: a survey[J]. J Artificial Intelligence Research, 1996, 4(1): 237-285. [3] Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J]. arXiv preprint arXiv:13125602, 2013. [4] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533. [5] Kober J, Peters J. Reinforcement learning in robotics: a survey[J]. International Journal of Robotics Research, 2013, 32(11): 1238-1274. [6] Yogatama D, Blunsom P, Dyer C, Grefenstette E, Ling W. Learning to compose words into sentences with reinforcement learning[J]. arXiv preprint arXiv:161109100, 2016. [7] Li L, Chu W, Langford J, et al. A contextual-bandit approach to personalized news article recommendation[C]//Proceedings of the 19th international conference on World wide web. ACM, 2010: 661-670. [8] 余凯,贾磊,陈雨强,徐伟.深度学习的昨天、今天和明天[J].计算机研究与发展,2013,50(9):1799-1804. [9] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489. [10] Sutton R S, Barto A G. Reinforcement learning: an introduction[M]. MIT press, 2018. [11] Levine S, Koltun V. Guided policy search[C]//International Conference on Machine Learning. 2013. 1-9. [12] Deisenroth M, Rasmussen C E. PILCO: A model-based and data-efficient approach to policy search[C]//Proceedings of the 28th International Conference on machine learning (ICML-11). 2011: 465-472. [13] Watkins C J C H, Dayan P. Q-learning[J]. Machine learning, 1992, 8(3-4): 279-292. [14] Spall J C. Estimation via markov chain monte carlo[J]. Control Systems IEEE, 2002, 23(2): 34-45. [15] Sutton R S. Learning to predict by the methods of temporal differences[J]. Machine Learning, 1988, 3(1): 9-44. [16] Li Y. Deep reinforcement learning: an overview[J]. arXiv preprint arXiv:170107274, 2017. [17] Arjona-Medina J A, Gillhofer M, Widrich M, et al. RUDDER: return decomposition for delayed rewards[J]. arXiv preprint arXiv:1806.07857, 2018. [18] Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3-4): 229-256. [19] Konda V. Actor-critic algorithms[J]. Siam Journal on Control and Optimization, 2003, 42(4): 1143-1166. [20] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[C]//International conference on machine learning. 2016. 1928-1937. [21] Gendreau M, Laporte G, Séguin R. An exact algorithm for the vehicle routing problem with stochastic demands and customers[M]. INFORMS, 1995. [22] 杨华龙,叶迪,张倩,曾庆成.时间窗变动的车辆调度干扰管理模型与算法[J].运筹与管理,2017,(10):56-64. [23] 郑斐峰,梅启煌,刘明,张小宁.基于遗传算法与贪婪策略的多港口集装箱配载研究[J]. 运筹与管理,2018,27(5):1-7. [24] 邰世文,商剑平.煤炭码头卸车调度问题多目标优化模型及算法[J].运筹与管理,2018,27(6):91-99. [25] Puterman M L. Markov decision processes: discrete stochastic dynamic programming[M]. John Wiley & Sons, 2014. [26] Van Roy B, Bertsekas D P, Lee Y, Tsitsiklis J N. A neuro-dynamic programming approach to retailer inventory management[C]//Decision and Control, 1997., Proceedings of the 36th IEEE Conference on. IEEE, 1997, 4: 4052-4057. [27] Kim C O, Jun J, Baek J K, Smith R L, Kim Y D. Adaptive inventory control models for supply chain management[J]. The International Journal of Advanced Manufacturing Technology, 2004, 26(9-10): 1184-1192. [28] Kim C O, Kwon I H, Baek J G. Asynchronous action-reward learning for nonstationary serial supply chain inventory control[J]. Applied Intelligence, 2007, 28(1): 1-16. [29] Kwon I H, Chang O K, Jin J, Lee J H. Case-based myopic reinforcement learning for satisfying target service level in supply chain[J]. Expert Systems with Applications, 2008, 35(1-2): 389-397. [30] Jiang C, Sheng Z. Case-based reinforcement learning for dynamic inventory control in a multi-agent supply-chain system[J]. Expert Systems with Applications, 2009, 36(3): 6520-6526. [31] Giannoccaro I, Pontrandolfo P. Inventory management in supply chains: a reinforcement learning approach[J]. International Journal of Production Economics, 2002, 78(2): 153-161. [32] Chaharsooghi S K, Heydari J, Zegordi S H. A reinforcement learning model for supply chain ordering management: an application to the beer game[J]. Decision Support Systems, 2008, 45(4): 949-959. [33] Tongeren T V, Kaymak U, Naso D, Asperen E V. Q-learning in a competitive supply chain[C]//Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on. IEEE, 2007. 1211-1216. [34] Mortazavi A, Arshadi Khamseh A, Azimi P. Designing of an intelligent self-adaptive model for supply chain ordering management system[J]. Engineering Applications of Artificial Intelligence, 2015, 37: 207-220. [35] Oroojlooyjadid A, Nazari M, Snyder L, Taká M. A deep q-network for the beer game with partial information[J]. arXiv preprint arXiv:170805924, 2017. [36] 王皓,高阳,陈兴国.强化学习中的迁移:方法和进展[J].电子学报,2008,36(s1):39-43. [37] Gijsbrechts J, Boute R N, Van Mieghem J A, Zhang D. Can deep reinforcement learning improve inventory management? performance and implementation of dual sourcing-mode problems[J]. Performance and Implementation of Dual Sourcing-Mode Problems(December 17, 2018), 2018. [38] Valluri A, North M J, Macal C M. Reinforcement learning in supply chains[J]. International Journal of Neural Systems, 2009, 19(05): 331-344. [39] Sun R, Zhao G, Yin C. A multi-agent coordination of a supply chain ordering management with multiple members using reinforcement learning[C]//Industrial Informatics(INDIN), 2010 8th IEEE International Conference on. IEEE, 2010: 612-616. [40] Kim C O, Kwon I-H, Kwak C. Multi-agent based distributed inventory control model[J]. Expert Systems with Applications, 2010, 37(7): 5186-5191. [41] Kwak C, Choi J S, Kim C O, Kwon I H. Situation reactive approach to vendor managed inventory problem[J]. Expert Systems with Applications, 2009, 36(5): 9039-9045. [42] Yang S, Zhang J. Adaptive inventory control and bullwhip effect analysis for supply chains with non-stationary demand[C]//Chinese Contwol and Deciaion Conlesence, 2015. 3903-3908. [43] Sui Z, Gosavi A, Lin L. A reinforcement learning approach for inventory replenishment in vendor-managed inventory systems with consignment inventory[J]. Engineering Management Journal, 2010, 22(4): 44-53. [44] Katanyukul T, Duff W S, Chong E K P. Approximate dynamic programming for an inventory problem: empirical comparison[J]. Computers & Industrial Engineering, 2011, 60(4): 719-743. [45] Rummery G A, Niranjan M. On-line Q-learning using connectionist systems[M]. Cambridge, England: University of Cambridge, Department of Engineering, 1994. [46] Bertsekas D P, Tsitsiklis J N. Neuro-dynamic programming: an overview[C]//Proceedings of the 34th IEEE Conference on Decision and Control. Piscataway, NJ: IEEE Publ., 1995, 1: 560-564. [47] Kara A, Dogan I. Reinforcement learning approaches for specifying ordering policies of perishable inventory systems[J]. Expert Systems with Applications, 2018, 91: 150-158. [48] Dantzig G B, Ramser J H. The truck dispatching problem[J]. Management Science, 1959, 6(1): 80-91. [49] Garey M R, Johnson D S. Computers and intractability: a guide to the theory of NP-completeness[M]. W. H. Freeman, 1986. [50] Psaraftis H N. A dynamic programming solution to the single vehicle many-to-many immediate request dial-a-ride problem[J]. Transportation Science, 1980, 14(2): 130-154. [51] Secomandi N. Analysis of a rollout approach to sequencing problems with stochastic routing applications[J]. Journal of Heuristics, 2003, 9(4): 321-352. [52] Toriello A, Haskell W B, Poremba M. A dynamic traveling salesman problem with stochastic arc costs[J]. Operations Research, 2014, (62): 1107-1125. [53] Bello I, Pham H, Le Q V, Norouzi M, Bengio S. Neural combinatorial optimization with reinforcement learning[J]. arXiv preprint arXiv:161109940, 2016. [54] Vinyals O, Fortunato M, Jaitly N. Pointer networks[C]//Advances in Neural Information Processing Systems, 2015: 2692-2700. [55] Dai H, Khalil E B, Zhang Y, Dilkina B, Song L. Learning combinatorial optimization algorithms over graphs[J]. arXiv preprint arXiv: 170401665, 2017. [56] Riedmiller M. Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method[C]//European Conference on Machine Learning. Springer, Berlin, Heidelberg, 2005. 317-328. [57] Bengio Y, Lodi A, Prouvost A. Machine learning for combinatorial optimization: a methodological tour d’horizon[J]. arXiv preprint arXiv:181106128, 2018. [58] Emami P, Ranka S. Learning permutations with sinkhorn policy gradient[J]. arXiv preprint arXiv:180507010, 2018. [59] Mena G, Belanger D, Linderman S, Snoek J. Learning latent permutations with gumbel-sinkhorn networks[J]. arXiv preprint arXiv:1802.08665, 2018. [60] Kool W, Welling M. Attention solves your TSP[J]. arXiv preprint arXiv:180308475, 2018. [61] Deudon M, Cournut P, Lacoste A, Adulyasak Y, Rousseau L M. Learning heuristics for the TSP by policy gradient[C]//International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research. Springer, Cham, 2018.170-181. [62] Laporte G, Louveaux F V, Hamme L V. An integer l-shaped algorithm for the capacitated vehicle routing problem with stochastic demands[J]. Operations Research, 2002, 50(3): 415-423. [63] Dror M, Laporte G, Trudeau P. Vehicle routing with stochastic demands: properties and solution frameworks[J]. Transportation Science, 1989, 23(3): 166-176. [64] Secomandi N. Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands[J]. Computers & Operations Research, 2000, 27(11-12): 1201-1225. [65] Novoa C, Storer R. An approximate dynamic programming approach for the vehicle routing problem with stochastic demands[J]. European Journal of Operational Research, 2009, 196(2): 509-515. [66] Goodson J C, Thomas B W, Ohlmann J W. Restocking-based rollout policies for the vehicle routing problem with stochastic demand and duration limits[J]. Transportation Science, 2015, 50(2): 591-607. [67] Bertazzi L, Secomandi N. Faster rollout search for the vehicle routing problem with stochastic demands and restocking[J]. European Journal of Operational Research, 2018, 270(2): 487-497. [68] Tan Y. Automated scheduling: reinforcement learning approach to algorithm policy learning[C]//Advances in Artificial Intelligence: 31st Canadian Conference on Artificial Intelligence, Canadian AI 2018, Toronto, ON, Canada, May 8–11, 2018, Proceedings 31. Springer International Publishing, 2018. 335-338. [69] Fan J, Wang X, Ning H. A multiple vehicles routing problem algorithm with stochastic demand[C]//Intelligent Control and Automation, 2006. WCICA 2006. The Sixth World Congress on. IEEE, 2006, 1: 1688-1692. [70] 娄山佐,吴耀华,肖际伟,廖莉.基于增强学习解决随机需求车辆路径问题[J].系统仿真学报,2008,20(14):3675-3678. [71] Zhang C C, Dellaert N N, Zhao L L, Woensel V T T, Sever D D. Single vehicle routing with stochastic demands: approximate dynamic programming[J]. Dept. Ind. Eng., Tsinghua Univ., Beijing, China, Tech. Rep, 2013. 1-37. [72] Nazari M, Oroojlooy A, Snyder L V, Taká M. Deep reinforcement learning for solving the vehicle routing problem[J]. arXiv preprint arXiv:180204240, 2018. [73] Meisel S, Suppa U, Mattfeld D. Serving multiple urban areas with stochastic customer requests[M]. Springer Berlin Heidelberg, 2011. [74] Ulmer M W, Mattfeld D C, Hennig M, Goodson J C. A rollout algorithm for vehicle routing with stochastic customer requests[M]. Springer International Publishing, 2016. [75] Ulmer M W, Goodson J C, Mattfeld D C, Hennig M. Offline-online approximate dynamic programming for dynamic vehicle routing with stochastic requests[J]. Transportation Science, 2018. [76] Ulmer M W, Mattfeld D C, Soeffker N. Dynamic multi-period vehicle routing: approximate value iteration based on dynamic lookup tables[J]. Technical Report, Technische University; t Braunschweig, 2016. 1-28. [77] Nasrollahzadeh A A, Khademi A, Mayorga M E. Real-time ambulance dispatching and relocation[J]. Manufacturing & Service Operations Management, 2018. [78] Maxwell M S, Restrepo M, Henderson S G, Topaloglu H. Approximate dynamic programming for ambulance redeployment[J]. INFORMS Journal on Computing, 2010, 22(2): 266-281. [79] Kellerer H, Pferschy U, Pisinger D. Knapsack problems[M]. Springer Berlin Heidelberg, 2004. [80] Garey M R, Johnson D S. Approximation algorithms for bin packing problems: a survey[M]. Springer Vienna, 1981. [81] Kang J, Park S. Algorithms for the variable sized bin packing problem[J]. European Journal of Operational Research, 2003, 147(2): 365-372. [82] Christensen H I, Khan A, Pokutta S, Tetali P. Approximation and online algorithms for multidimensional bin packing: a survey[J]. Computer Science Review, 2017, 24: 63-79. [83] Kleywegt A J, Papastavrou J D. The dynamic and stochastic knapsack problem[M]. INFORMS, 1998. [84] Kleywegt A J, Papastavrou J D. The dynamic and stochastic knapsack problem with random sized items[J]. Operations Research, 2001, 49(1): 26-41. [85] Mastin A, Jaillet P. Average-case performance of rollout algorithms for knapsack problems[J]. Journal of Optimization Theory & Applications, 2014, 165(3): 1-21. [86] Kong W, Liaw C, Mehta A, Sivakumar D. A new dog learns old tricks: RL finds classic optimization algorithms[C]//Intemational Conlewence on Leawning Retwesentations, 2019: 1-25. [87] Bertsimas D, Demir R. An approximate dynamic programming approach to multidimensional knapsack problems[J]. Management Science, 2002, 48(4): 550-565. [88] Perry T C, Hartman J C. An approximate dynamic programming approach to solving a dynamic, stochastic multiple knapsack problem[J]. International Transactions in Operational Research, 2009, 16(3): 347-359. [89] Hua Z, Zhang B, Liang L. An approximate dynamic programming approach to convex quadratic knapsack problems[J]. Computers & Operations Research, 2006, 33(3): 660-673. [90] Goodson J C, Thomas B W, Ohlmann J W. A rollout algorithm framework for heuristic solutions to finite-horizon stochastic dynamic programs[J]. European Journal of Operational Research, 2017, 258(1): 216-229. [91] Hu H, Zhang X, Yan X, Wang L, Xu Y. Solving a new 3d bin packing problem with deep reinforcement learning method[J]. arXiv preprint arXiv:170805930, 2017. [92] Hu H, Duan L, Zhang X, Xu Y, Wei J. A multi-task selected learning approach for solving new type 3D bin packing problem[J]. arXiv preprint arXiv:180406896, 2018. [93] Laterre A, Fu Y, Jabri M K, et al. Ranked reward: enabling self-play reinforcement learning for combinatorial optimization[J]. arXiv preprint arXiv:180701672, 2018. [94] Browne C B, Powley E, Whitehouse D, et al. A survey of monte carlo tree search methods[J]. IEEE Transactions on Computational Intelligence and AI in games, 2012, 4(1): 1-43. [95] Zhang W, Dietterich T G. A reinforcement learning approach to job-shop scheduling[C]//IJCAI. 1995, 95: 1114-1120. [96] Riedmiller S, Riedmiller M. A neural reinforcement learning approach to learn local dispatching policies in production scheduling[C]//IJCAI. 1999, 2: 764-771. [97] Aydin M E, Ztemel E. Dynamic job-shop scheduling using reinforcement learning agents[J]. Robotics & Autonomous Systems, 2000, 33(2): 169-178. [98] 魏英姿,赵明扬.一种基于强化学习的作业车间动态调度方法[J].自动化学报,2005,31(5):765-771. [99] Chen X, Hao X C, Lin H W, Murata T. Rule driven multi objective dynamic scheduling by data envelopment analysis and reinforcement learning[C]//Automation and Logistics(ICAL), 2010 IEEE International Conference on. IEEE, 2010. 396-401. [100] 魏英姿,谷侃锋.基于性能预测的遗传强化学习动态调度方法[J].系统仿真学报,2010,22(12):2809-2812. [101] Shahrabi J, Adibi M A, Mahootchi M. A reinforcement learning approach to parameter estimation in dynamic job shop scheduling[J]. Computers & Industrial Engineering, 2017, 110: 75-82. [102] 潘燕春,冯允成,周泓,魏佳呈.强化学习和仿真相结合的车间作业排序系统[J].控制与决策,2007,22(6):675-679. [103] Gabel T, Riedmiller M. Distributed policy search reinforcement learning for job-shop scheduling tasks[J]. International Journal of Production Research, 2012, 50(1): 41-61. [104] Qu S, Jie W, Govil S, Leckie J O. Optimized adaptive scheduling of a manufacturing process system with multi-skill workforce and multiple machine types: an ontology-based, multi-agent reinforcement learning approach[J]. Procedia Cirp, 2016, 57: 55-60. [105] Mao H, Alizadeh M, Menache I, Kandula S. Resource management with deep reinforcement learning[C]//Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, 2016. 50-56. [106] Chen W, Xu Y, Wu X. Deep reinforcement learning for multi-resource multi-machine job scheduling[J]. arXiv preprint arXiv: 171107440, 2017. [107] Waschneck B, Reichstaller A, Belzner L, et al. Optimization of global production scheduling with deep reinforcement learning[J]. Procedia CIRP, 2018, 72: 1264-1269. [108] Chen X, Tian Y. Learning to progressively plan[J]. arXiv preprint arXiv:181000337, 2018. [109] Wang Y C, Usher J M. Learning policies for single machine job dispatching[J]. Robotics & Computer Integrated Manufacturing, 2004, 20(6): 553-562. [110] 王世进,孙晟,周炳海,奚立峰.基于Q-学习的动态单机调度[J].上海交通大学学报,2007,41(8):1227-1232. [111] Xanthopoulos A S, Koulouriotis D E, Tourassis V D, Emiris D M. Intelligent controllers for bi-objective dynamic scheduling on a single machine with sequence-dependent setups[J]. Applied Soft Computing, 2013, 13(12): 4704-4717. [112] 王国磊,钟诗胜,林琳.基于聚类状态隶属度的动态调度Q-学习[J].高技术通讯,2009,19(4):428-433. [113] Clark A J, Scarf H. Optimal policies for a multi-echelon inventory problem[J]. Management Sciece, 1960, 50(12): 1782-1790. [114] Sterman J D. Modeling managerial behavior: misperceptions of feedback in a dynamic decision making experiment[J]. Management science, 1989, 35(3): 321-339. [115] Dantzig G B. Discrete-variable extremum problems[J]. Operations Research, 1957, 5(2): 266-277. [116] Grandl R, Ananthanarayanan G, Kandula S, Rao S, Akella A. Multi-resource packing for cluster schedulers[J]. ACM SIGCOMM Computer Communication Review, 2015, 44(4): 455-466. [117] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks[J]. arXiv preprint arXiv: 170303400, 2017. |
[1] | YANG Feng, YE Chun-ming, CHONG Da-shuang. Decision Model of Rescue Demand in Urban Emergency Based on Accident Evolution and Its Optimization Solution [J]. Operations Research and Management Science, 2020, 29(8): 79-88. |
[2] | LIANG Feng, XU Ping. Appointment Scheduling of Medical Examination Based on MDP and Dynamic programming [J]. Operations Research and Management Science, 2020, 29(5): 17-25. |
[3] | REN Jian-feng, YE Chun-ming, YANG Feng. Research on Path Optimization Modeling and Algorithm of WorkshopHandling Robotwith Time Window [J]. Operations Research and Management Science, 2020, 29(5): 52-60. |
[4] | LI Cai, GENG Na, WANG Chun-ming. Optimal Ordering and Issuing Policy for Platelet Inventory ControlProblem Based on Markov Decision Process [J]. Operations Research and Management Science, 2019, 28(7): 108-117. |
[5] | LIU Yang, GENG Na. Outpatient Scheduling for Multiple Examinations [J]. Operations Research and Management Science, 2017, 26(9): 78-87. |
[6] | LI Zhi, TAN De-qing. A Stochastic Dynamic Programming for Assemble-To-Order SystemOptimization with Capacitated Inventory [J]. Operations Research and Management Science, 2017, 26(7): 21-28. |
[7] | CHEN Jun-lin, ZHAO Xiao-bo, SONG Ya-nan, CHEN Jian-ming. An Experimental Study of Fairness and Learning in a Triadic Supply Chain [J]. Operations Research and Management Science, 2015, 24(2): 20-28. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||