运筹与管理 ›› 2026, Vol. 35 ›› Issue (1): 219-225.DOI: 10.12005/orms.2026.0031

• 管理科学 • 上一篇    下一篇

数据和知识双驱动的信贷违约预测学习

邵元海1, 刘文正2, 宋祎玮2   

  1. 1.海南大学 数学与统计学院,海南 海口 570100;
    2.海南大学 国际商学院,海南 海口 570100
  • 收稿日期:2024-04-20 发布日期:2026-06-04
  • 通讯作者: 邵元海(1983-),男,新疆伊犁人,教授,博士,博士生导师,研究方向:支持向量机,机器学习和最优化方法及应用研究。Email: shaoyuanhai21@163.com。
  • 基金资助:
    国家自然科学基金资助项目(12271131)

Data and Knowledge Driven Ensemble Credit Default Prediction

SHAO Yuanhai1, LIU Wenzheng2, SONG Yiwei2   

  1. 1. School of Mathematics and Statistics, Hainan University, Haikou 570100, China;
    2. International Business School, Hainan University, Haikou 570100, China
  • Received:2024-04-20 Published:2026-06-04

摘要: 随着互联网技术的快速发展,金融平台和产品的多元化带来的信贷违约风险对金融市场稳定构成了挑战。信贷违约的精准预测对金融行业至关重要,机器学习方法已广泛使用于信贷违约预测。然而,传统预测方法需要大量信贷数据进行驱动,忽视了金融平台产品积累的领域知识和数据不足的问题。对此,本文提出了统计不变量集成学习,通过运用金融领域的专业知识构建谓词,并使用领域知识谓词与数据联合形成统计不变量,实现了数据和知识双驱动学习,从而显著提升信贷违约预测的泛化性能,特别是对数据量较小的时候。为验证所提方法的有效性,本文在多个信贷违约数据上进行对比实验以及统计检验,实验结果显示,该模型在信贷违约预测的关键指标上均实现了改进并且和以往模型之间有明显的差异性。

关键词: 信贷违约, 预测, 集成学习, 统计不变量学习

Abstract: Credit default prediction, which is important in credit business, refers to an assessment of whether a borrower is likely to default. However, with the low threshold of online lending accompanied by the high risk of investment returns, platforms often lack financial data and find it difficult to use domain knowledge to guide prediction. It is challenging to achieve high accuracy for traditional machine learning methods. To tackle this challenge, it is imperative to devise a method capable of proficiently harnessing the available knowledge embedded within the existing data to augment model training. This will aid in accurately and efficiently identifying loan users who are at risk of defaulting on their loans.
This paper proposes a novel credit default prediction method called Ensemble Statistical Invariants Learning (ESIL), which integrates the ideas of Learning Using Statistical Invariants (LUSI) and ensemble learning. It first applies financial domain expertise to construct predicates and form the statistical invariants in order to utilize credit default domain knowledge, which will accelerate the convergence of the expectation function in admissible function set based on the using of weak mode of convergence. Furthermore, it leverages ensemble learning to choose the most suitable domain knowledge, and it can be demonstrated that computing the corresponding thresholds leads to the selection of predicates, resulting in a decrease in the objective function by selecting suitable predicates from a predefined set. This strategy ensures that ESIL effectively optimizes the model during the iterative process. Besides, to maintain the integrity of the sampled data structure, easily classifiable samples are added to the existing imbalanced samples to correct the data distribution. By combining the statistical invariants and the ensemble learning obtained from the credit domain, ESIL achieves the data and is knowledge dual-driven in the domain of credit default prediction, offering a new approach for conducting credit assessment in the era of online platforms, particularly in scenarios with insufficient samples.
Subsequently, a series of experiments are conducted to validate the effectiveness of the EUSI by using real credit data from UCI, Kaggle, Tianchi and DataFoutain. In this paper, ten predicates are constructed according to the data imbalance in credit default issues, customer-specific attributes, anomalous default samples and some commonly used predicates. The following conclusions are drawn from the experiments. First, through comparative experiments on three indicators commonly used in credit default, the experimental results show that the ESIL model demonstrates excellent performance on AUC,F1 measure and G-means indicators. Especially on small datasets, the improvement can exceed 8% compared to the benchmark methods. Second, Friedman test and Nemenyi test demonstrate that ESIL is significantly superior to other models on the results of the comparison experiments. Third, this study explores the prediction effect of the ESIL model and analyzes the relevant properties on credit data. The results reveal that the ESIL model is still able to maintain a high prediction accuracy compared to other models when the size of data is small. Meanwhile, the enhancement in accuracy resulting from the combination of various statistical invariants finds that the prediction accuracy of the model is improved accordingly with the increase in the number of predicate matrices, and proves its acceleration in convergence on small datasets. Finally, taking Chinese Taiwan dataset as an example, it analyses the interpretability of ESIL in credit default prediction.
The proposed ESIL provides new ideas for the theoretical development of credit default prediction by integrating multiple statistical invariants utilizing domain knowledge. It shows that ESIL improves the accuracy and provides new strategies for the practical application of financial risk management. In the future, developing different ensemble learning techniques and comparing various ensemble methods will contribute to optimizing the performance of ESIL in credit risk assessment. Distinct ensemble strategies may be more suited for specific data distributions and types of problems, allowing us to choose the most appropriate ensemble approach to enhance the predictive prowess of ESIL. It is also interesting to investigate the applicability of ESIL beyond the credit default prediction. Expanding ESIL to diverse domains through experiments on various datasets would underscore its adaptability and performance across a wide array of scenarios.

Key words: credit default, prediction, ensemble learning, learning using statistic invariants

中图分类号: