基于K-medoids-NCA-SMOTE-BSVM融合模型的网络交易平台高质量数据资源识别研究

doi:10.12005/orms.2023.0357

摘要/Abstract

摘要： 随着数据服务形态不断衍生,数据资源作为一种新兴生产要素,其交易流通需求呈现爆发式增长。如何从海量数据中识别高质量数据资源,挖掘要素价值,成为数据交易平台获取竞争优势以及提升要素配置效率的关键。本文旨在发现平台交易情境下高质量数据形成的关键因素,提出从大规模、异质数据资源中高效识别高质量数据的方法。首先,基于高质量数据形成过程,构建“固有品质-商品表征”二维识别指标体系;然后,提出K-medoids-NCA-SMOTE-BSVM融合模型,对高、中、低三类不同质量数据进行分类预测;最后,收集真实数据交易平台的API交易数据,开展实证研究。结果显示:相比SVM,WOA-SVM,PSO-SVM,MLP和CNN等方法,K-medoids-NCA-SMOTE-BSVM模型在预测准确率和训练时间方面,均有良好的性能表现。本文提出的识别指标及分类模型,为平台经济下数据质量判断与预测提供了依据,对产品视角下数据质量标准制定以及数据交易定价优化具有一定实践意义。

关键词: 数据交易平台, 高质量数据, K-medoids-NCA-SMOTE-BSVM, 多模型集成

Abstract: As an emerging factor of production, the demand for trading and circulation of data resources has shown explosive growth. The problem of data quality has sparked widespread concern along with the exponential growth of data scale, and a lot of low-quality data is flooding into different types of data resource trade platforms. How to identify high-quality data resources in the massive resources has become the key for data trading platforms to gain competitive advantages and improve the efficiency of factor allocation. Existing research has provided a basis for high-quality data identification in the platform trading context, but there are still two deficiencies: Firstly, it is challenging to meet the requirements for large-scale data resources’quality identification because the existing identification methods, which are only applicable to the quality evaluation of small-scale and homogeneous data resources, have more manual participation components and insufficient automation. Secondly, the existing identification methods ignore the problem of uneven distribution of data resources of different quality, which easily triggers the bias of classification results and is difficult to meet the robustness requirements of heterogeneous sample classification. This paper is to clarify the mechanism of high-quality data resource formation in the context of platform transactions, discover the key factors for high-quality data resource formation in the context of platform transactions, and propose a method for efficiently recognizing high-quality data from large-scale and heterogeneous data resources.
Data circulation and transaction are necessary for the realization of the value of data resources, and in a platform economy, data circulation is in the form of an open market with numerous participants. This paper studies the flow of data resources and the generation process of high-quality data within the platform environment and builds a high-quality data resource identification index system of “intrinsic quality-commodity characterization”. After that, it suggests a K-medoids-NCA-SMOTE-BSVM fusion model, handles the identification of high-quality data as a pattern recognition problem, and uses supervised machine learning to identify high-quality data and solve the issue. Four primary sections make up the model: (1)Synthesizing the number of views, collections and downloads of data resources as the basis of discrimination, using K-medoids to cluster the samples of data resources, automatically creating the classification labels for data resources, and calculating the ideal number of classification labels by combining them with profile coefficients. (2)The built metrics are downscaled using Nearest Neighbor Component Analysis (NCA) to come up with a new set of features, taking into account the possibility that the chosen metrics contain elements that are less important for the categorization of high-quality data resources and may therefore influence the model’s effectiveness. (3)After clustering division, the number of samples from different classes fluctuates widely, and therefore, to maintain the balance of data between classes, the few class oversampling method (SMOTE) is used to increase the amount of data from the few sample classes under the new feature set. (4)A nonlinear high-quality data resource identification model based on Bayesian optimization support vector machine (BSVM) is constructed, and it achieves classification prediction of high, medium, and low quality data of various calibers by using the data resource identification indexes after feature dimensionality reduction, the clustered data resources as input, labeling the clustered data resources with categories, and balancing the dataset as the model’s output. Finally, based on the API datasets of real data trading platform, Python is used to crawl the request parameters, return parameters, update frequency, data sources, data descriptions, labels, application scenarios, specifications, registered capital of service merchants, views, downloads, and favorites of data resources to carry out the empirical research.
The results show that: a)SMOTE balanced processing can improve the effect of data resource quality identification and improve the classification accuracy of the optimization model based on the comparison of unbalanced with balanced datasets. b) Whether based on imbalanced or balanced datasets, BSVM outperforms SVM, WOA-SVM, PSO-SVM, MLP, and CNN approaches in terms of prediction accuracy, and BSVM has higher algorithmic efficacy with less training time than other optimization algorithms. In summary, this paper, which is an innovative attempt and a significant addition to the theory of data resource quality assessment, clarifies the meaning of high-quality data resources, builds a high-quality data resource identification index system, and fully verifies the validity of the index system with the aid of trading platform data. It also builds a high-quality data resource identification model, which can effectively generate the quality labels of massively parallel data sets. It has significant guiding relevance for encouraging the active trading of data resources and can effectively develop quality labels for vast data resources. It can also increase the recognition accuracy of heterogeneous data resources.

Key words: data trading platform, high quality data, K-medoids-NCA-SMOTE-BSVM, multi model integration

中图分类号:

F724.6

倪渊, 李思远, 徐磊, 张健, 房津玉. 基于K-medoids-NCA-SMOTE-BSVM融合模型的网络交易平台高质量数据资源识别研究[J]. 运筹与管理, 2023, 32(11): 87-93.

NI Yuan, LI Siyuan, XU Lei, ZHANG Jian, FANG Jinyu. High-quality Data Resource Identification of Network Trading Platform Based on K-medoids-NCA-SMOTE-BSVM Model[J]. Operations Research and Management Science, 2023, 32(11): 87-93.

参考文献

[1] 陈收,蒲石,方颖,等.数字经济的新规律[J].管理科学学报,2021,24(8):36-47.
[2] 刘国栋,朱建军,刘小弟.基于灰色关联度-云模型的群评价数据质量改进方法及应用研究[J].运筹与管理,2021,30(3):144-150.
[3] 贾俊秀,王晨,吴涛,等.考虑大众健康数据共享回报的数据定价决策[J/OL].运筹与管理,2022:1-12[2023-11-04].http://kns.cnki.net/kcms/detail/34.1133.G3.20220531.1108.002.html.
[4] CICHY C, RASS S. An overview of data quality frameworks[J]. IEEE Access, 2019, 7: 24634-24648.
[5] WANG R Y, STRONG D M. Beyond accuracy: What data quality means to data consumers[J]. Journal of Management Information Systems, 1996, 12(4): 5-33.
[6] REDMAN T C. Data Quality: The Field Guide[M].New Youk: Digital Press, 2000.
[7] AEBI D, PERROCHON L. Towardsimproving data quality[C]//Proceedings of the International Conference on Information Systems and Management of Data. Delhi: Sarda, 1993: 273-281.
[8] 曹建军,刁兴春.数据质量导论[M].北京:国防工业出版社,2017.
[9] WANG R Y, STOREY V C. A framework for analysis of data quality research[J]. IEEE Transactions on Knowledge and Data Engineering, 1995, 7(4): 623-640.
[10] 江洪,王春晓.基于科学数据生命周期管理阶段的科学数据质量评价体系构建研究[J].图书情报工作,2020,64(10):19-27.
[11] 林平,何思奇,段尧清.数据与用户视角下政府开放数据服务水平评价研究[J].图书情报工作,2020,64(2):23-29.
[12] 张晓娟,唐长乐.管理视角下数字信息资源长期保存元数据研究进展[J].图书情报知识,2019(3):43-52.
[13] PEER E, ROTHSCHILD D, GORDON A, et al. Data quality of platforms and panels for online behavioral research[J]. Behavior Research Methods, 2021, 54: 1643-1662.
[14] CARO A, CALERO C, CABALLERO I, et al. A proposal for a set of attributes relevant for Web portal data quality[J]. Software Quality Control, 2008, 16(4): 513-542.
[15] 孙嘉睿,安小米.开放政府数据质量评估指标体系研究[J].情报理论与实践,2023,46(6):94-100,78.
[16] HEINRICH B, KLIER M. Metric-based data quality assessment: Developing and evaluating a probability-based currency metric[J]. Decision Support Systems, 2015, 72: 82-96.
[17] 林娟娟,黄志刚,唐勇.数据质量、数量与数据资产定价:基于消费者异质性视角[J/OL].中国管理科学:1-12[2023-11-04].https://doi.org/10.16381/j.cnki.issn1003-207x.2022.0444.
[18] 刘叶,吴晟,周海河,等.基于K-means聚类算法优化方法的研究[J].信息技术,2019,43(1):66-70.
[19] 周爱君,努尔布力,艾壮,等.基于近邻成分分析的WebShell特征处理算法研究[J].计算机工程与应用,2021,57(16):125-133.