数据和知识双驱动的信贷违约预测学习

doi:10.12005/orms.2026.0031

摘要/Abstract

摘要： 随着互联网技术的快速发展,金融平台和产品的多元化带来的信贷违约风险对金融市场稳定构成了挑战。信贷违约的精准预测对金融行业至关重要,机器学习方法已广泛使用于信贷违约预测。然而,传统预测方法需要大量信贷数据进行驱动,忽视了金融平台产品积累的领域知识和数据不足的问题。对此,本文提出了统计不变量集成学习,通过运用金融领域的专业知识构建谓词,并使用领域知识谓词与数据联合形成统计不变量,实现了数据和知识双驱动学习,从而显著提升信贷违约预测的泛化性能,特别是对数据量较小的时候。为验证所提方法的有效性,本文在多个信贷违约数据上进行对比实验以及统计检验,实验结果显示,该模型在信贷违约预测的关键指标上均实现了改进并且和以往模型之间有明显的差异性。

关键词: 信贷违约, 预测, 集成学习, 统计不变量学习

Abstract: Credit default prediction, which is important in credit business, refers to an assessment of whether a borrower is likely to default. However, with the low threshold of online lending accompanied by the high risk of investment returns, platforms often lack financial data and find it difficult to use domain knowledge to guide prediction. It is challenging to achieve high accuracy for traditional machine learning methods. To tackle this challenge, it is imperative to devise a method capable of proficiently harnessing the available knowledge embedded within the existing data to augment model training. This will aid in accurately and efficiently identifying loan users who are at risk of defaulting on their loans.
This paper proposes a novel credit default prediction method called Ensemble Statistical Invariants Learning (ESIL), which integrates the ideas of Learning Using Statistical Invariants (LUSI) and ensemble learning. It first applies financial domain expertise to construct predicates and form the statistical invariants in order to utilize credit default domain knowledge, which will accelerate the convergence of the expectation function in admissible function set based on the using of weak mode of convergence. Furthermore, it leverages ensemble learning to choose the most suitable domain knowledge, and it can be demonstrated that computing the corresponding thresholds leads to the selection of predicates, resulting in a decrease in the objective function by selecting suitable predicates from a predefined set. This strategy ensures that ESIL effectively optimizes the model during the iterative process. Besides, to maintain the integrity of the sampled data structure, easily classifiable samples are added to the existing imbalanced samples to correct the data distribution. By combining the statistical invariants and the ensemble learning obtained from the credit domain, ESIL achieves the data and is knowledge dual-driven in the domain of credit default prediction, offering a new approach for conducting credit assessment in the era of online platforms, particularly in scenarios with insufficient samples.
Subsequently, a series of experiments are conducted to validate the effectiveness of the EUSI by using real credit data from UCI, Kaggle, Tianchi and DataFoutain. In this paper, ten predicates are constructed according to the data imbalance in credit default issues, customer-specific attributes, anomalous default samples and some commonly used predicates. The following conclusions are drawn from the experiments. First, through comparative experiments on three indicators commonly used in credit default, the experimental results show that the ESIL model demonstrates excellent performance on AUC,F1 measure and G-means indicators. Especially on small datasets, the improvement can exceed 8% compared to the benchmark methods. Second, Friedman test and Nemenyi test demonstrate that ESIL is significantly superior to other models on the results of the comparison experiments. Third, this study explores the prediction effect of the ESIL model and analyzes the relevant properties on credit data. The results reveal that the ESIL model is still able to maintain a high prediction accuracy compared to other models when the size of data is small. Meanwhile, the enhancement in accuracy resulting from the combination of various statistical invariants finds that the prediction accuracy of the model is improved accordingly with the increase in the number of predicate matrices, and proves its acceleration in convergence on small datasets. Finally, taking Chinese Taiwan dataset as an example, it analyses the interpretability of ESIL in credit default prediction.
The proposed ESIL provides new ideas for the theoretical development of credit default prediction by integrating multiple statistical invariants utilizing domain knowledge. It shows that ESIL improves the accuracy and provides new strategies for the practical application of financial risk management. In the future, developing different ensemble learning techniques and comparing various ensemble methods will contribute to optimizing the performance of ESIL in credit risk assessment. Distinct ensemble strategies may be more suited for specific data distributions and types of problems, allowing us to choose the most appropriate ensemble approach to enhance the predictive prowess of ESIL. It is also interesting to investigate the applicability of ESIL beyond the credit default prediction. Expanding ESIL to diverse domains through experiments on various datasets would underscore its adaptability and performance across a wide array of scenarios.

Key words: credit default, prediction, ensemble learning, learning using statistic invariants

中图分类号:

F272

邵元海, 刘文正, 宋祎玮. 数据和知识双驱动的信贷违约预测学习[J]. 运筹与管理, 2026, 35(1): 219-225.

SHAO Yuanhai, LIU Wenzheng, SONG Yiwei. Data and Knowledge Driven Ensemble Credit Default Prediction[J]. Operations Research and Management Science, 2026, 35(1): 219-225.

参考文献

[1] 新华网.消费金融持续升温[EB/OL].(2023-07-19)[2024-03-28].http://www.xinhuanet.com/202307/19/c_1129756794.htm.
[2] MUHAMMAD T, MELEMI A. Assessment of 5Cs relationship towards credit risk management: Evidence from Islamic banks[J]. Journal of Islamic Finance, 2021, 10(1): 76-89.
[3] BAKLOUTI I, BACCAR A. Evaluating the predictive accuracy of microloan officers’ subjective judgment[J]. International Journal of Research Studies in Management, 2013, 2(2): 21-34.
[4] 于立勇,詹捷辉.基于Logistic回归分析的违约概率预测研究[J].财经研究,2004,30(9):15-23.
[5] 申晴,张连增.一种新的银行信用风险识别方法:SVM-KNN组合模型[J].金融监管研究,2020(7):23-37.
[6] 陈湘州,陶李红.基于MLP神经网络的中小企业供应链金融信用风险评估[J].湖南科技大学学报:自然科学版,2021,36(4):91-99.
[7] 马晓君,沙靖岚,牛雪琪.基于LightGBM算法的P2P项目信用评级模型的设计及应用[J].数量经济技术经济研究,2018,35(5):144-160.
[8] 周荣喜,彭航,李欣宇.基于XGBoost算法的信用债违约预测模型[J].债券,2019(10):61-68.
[9] 高艺婕.基于最优基模型集成算法的信贷违约预测研究[J].智能计算机与应用,2023,13(7):64-70+75.
[10] 刘晓,周荣喜,李玉茹.基于Stacking算法集成的我国信用债违约预测[J].运筹与管理,2023,32(3):163-170.
[11] VAPNIK V, IZMAILOV R. Rethinking statistical learning theory: Learning using statistical invariants[J]. Machine Learning, 2019, 108(3): 381-423.
[12] HEINRICH L, PAWLAS Z. Weak and strong convergence of empirical distribution functions from germ-grain processes[J]. Statistics, 2008, 42(1): 49-65.
[13] ZHU M X, SHAO Y H. Classification by estimating the cumulative distribution function for small data[J]. IEEE Access, 2023, 11: 41142-41157.
[14] VAPNIK V, IZMAILOV R. V-matrix method of solving statistical inference problems[J]. Journal of Machine Learning Research, 2015, 16(1): 1683-1730.
[15] LIU Z Q, SHAO Y H. Learning using rebalanced statistical invariants for imbalanced classification[J]. Procedia Computer Science, 2022, 214: 203-211.
[16] KHALID M, ASHRAF I, MEHMOOD A, et al. GBSVM: Sentiment classification from unstructured reviews using ensemble classifier[J]. Applied Sciences, 2020, 10(8): (Article)2788.