运筹与管理 ›› 2025, Vol. 34 ›› Issue (8): 77-82.DOI: 10.12005/orms.2025.0244

• 理论分析与方法探讨 • 上一篇    下一篇

基于固定初始种群遗传算法的文本特征选择方法研究

王兆刚   

  1. 山东财经大学 金融学院,山东 济南 250014
  • 收稿日期:2024-07-04 发布日期:2025-12-04
  • 作者简介:王兆刚(1988-),男,山东肥城人,博士,讲师,研究方向:数据挖掘,知识可视化。Email: 20205553@sdufe.edu.cn。
  • 基金资助:
    山东省自然科学基金项目(ZR2021MG046);国家自然科学基金面上项目(72473080)

Research on Text Feature Selection Method Based onFixed Initial Population Genetic Algorithm

WANG Zhaogang   

  1. School of Finance, Shandong University of Finance and Economics, Jinan 250014, China
  • Received:2024-07-04 Published:2025-12-04

摘要: 基于遗传算法(Genetic Algorithm,GA)的文本特征选择研究,较多忽视了初始种群的随机性,对特征选择的不利影响。因此,本文提出一种卡方检验(CHI)结合固定初始种群GA的文本特征选择方法CHI_FIPGA,将GA的初始种群设定为选取CHI值较高的特征词,通过选取特征词数量的不同,保持初始种群中个体间的差异性,以分类模型的分类准确率作为适应度,经过选择、交叉、变异等遗传操作,在全体特征词范围内迭代寻优。选取中文文本分类实验数据集,运用多层感知器神经网络、随机森林、朴素贝叶斯、K近邻、决策树等不同分类模型,与GA,CHI_GA,PSO,CHI_PSO等方法的最优解结果进行对比分析。实验结果表明,相对于GA,CHI_GA,PSO,CHI_PSO方法,CHI_FIPGA的最优解分类准确率更高,特征词数量更少,尤其在类别数量较多的数据集上,CHI_FIPGA方法的优势更加明显。

关键词: 文本分类, 特征选择, 遗传算法, 卡方检验, 初始种群

Abstract: Selecting text features to reduce feature dimensions, improve classification accuracy, and reduce the time consumption of the classification process is undoubtedly a problem that text classification tasks inevitably face and need to solve in the context of big data. In existing text feature selection methods, genetic algorithms (GA) and other optimization algorithms are often used to transform the text feature selection problem into a feature optimization combination problem with classification accuracy as the goal.
Research on text feature selection based on GA has achieved significant results in reducing the optimization range of feature selection, reducing dependence on classification models, and mitigating the adverse effects of genetic randomness such as selection, crossover, and mutation. However, existing research has largely overlooked the randomness of the initial population of GA, which has a negative impact on feature iterative optimization selection. Randomizing the initial population ignores the contribution of feature words to classification, which to some extent increases the difficulty of the algorithm quickly converging near the global optimal solution.
Therefore, this article proposes a text feature selection method CHI_FIPGA that combines CHI square test with a fixed initial population GA. The initial population of GA is set to select feature words with higher CHI values. By selecting different numbers of feature words, the differences between individuals in the initial population are maintained. The classification accuracy of the classification model is used as the fitness, and genetic operations such as selection, crossover, and mutation are performed to iteratively optimize within the range of all feature words.
Multiple Chinese text classification experimental datasets are selected, and performed to preprocess operations on the data, including word segmentation, removal of stop words, matrix transformation, and feature word weighting. Different classification models such as multi-layer perceptron neural networks, random forests, naive Bayes, K-nearest neighbors, decision trees are used to compare optimal solution results with GA, CHI_GA, PSO, and CHI_PSO.
The experimental results indicate that compared with the random initial population GA, the CHI_FIPGA method has an average improvement of 14% in classification accuracy and an average reduction of 66% in the number of feature words. Compared to the CHI_GA method, the average classification accuracy of the CHI_FIPGA method improves by 9%, and the average number of feature words decreases by 36%. Compared with the PSO method, the CHI_FIPGA method has an average improvement of 7% in classification accuracy and an average reduction of 61% in the number of feature words. Compared with the CHI_PSO method, the CHI_FIPGA method has an average improvement of 3.4% in classification accuracy and a 25% reduction in the number of feature words.
CHI_FIPGA is based solely on the CHI method for feature word evaluation and ranking. It attempts to use various filtering methods such as IG, TF-IDF, F-value, etc. to comprehensively evaluate and rank feature words, analyze the differences in the contribution of feature words to different category classifications, and consider the structural characteristics of classification models in the feature word evaluation and ranking process. This is a feasible path to further improving the quality of the fixed initial population.

Key words: text classification, feature selection, genetic algorithm, CHI square test, initial population

中图分类号: