一种改进的k-modes聚类算法

doi:10.12005/orms.2019.0279

摘要/Abstract

摘要： 传统的K-modes算法采用了简单的0-1匹配来计算属性间的相异度,后改进为频率计算相异度,但是他们都忽略了各属性间的差异。本文研究了基于粗糙集和知识粒度的属性加权算法,该算法既克服了属性的冗余问题又综合考虑了各属性间的差异。在此基础上,通过对传统K-modes算法进行属性加权来改进K-modes算法中忽略的属性间差异问题。通过与其他的K-Modes算法进行实验比较,结果表明新的算法更加有效的。

关键词: 聚类算法, 分类属性数据, 粗糙集, 知识粒度, 距离度量

Abstract: The traditional K-modes algorithm, the simplematching dissimilarity measure, is used to compute the distance between two values of the samecategorical at tributes. This compares two categorical values directly and results in either a differenceof zero when the two values are identical or one if otherwise. However it ignores the differences among the attributes. In this paper, we studyan attribute weighting algorithm based on rough set and knowledge granulation. This algorithm not only overcomes the redundancy of attributes, but also takes into account the differences among attributes. Attributes weightingin the traditional K-modes algorithm are used to improve the K-modes algorithm to ignore the difference between attributes. Compared with other K-Modes clustering algorithms, the results show that the new algorithm is more effective.

Key words: clustering algorithm, categorical data, rough set, knowledge granulation, distance measure

中图分类号:

TP18

施振佺陈世平. 一种改进的k-modes聚类算法[J]. 运筹与管理, 2019, 28(12): 112-117.

SHI Zhen-quan, CHEN Shi-ping. An Improved K-Modes Clustering Algorithm[J]. Operations Research and Management Science, 2019, 28(12): 112-117.

参考文献

[1] MacQueen J. Some methods for classification and analysis of multivariate observations[C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, University of California Press, 1967. 281-297.
[2] Zhexue Huang. Extensions to the K-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Know ledge Discovery, 1998, 2(3): 283-304.
[3] Zhexue Huang. Clustering large data sets with mixed numeric and categorical values[C]. Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore: World Scientific, 1997. 21-34.
[4] He Z, Deng S, Xu X. Improving k-modes algorithm considering frequencies of attribute values in mode[C]. International Conference on Computational Intelligence and Security, LNAI 3801, 2005. 157-162.
[5] Huang Z, Ng M. A note on k-modes clustering[J]. Journal of Classification, 2003(20): 257-261.
[6] Ng M K, Junjie Li, Zhexue Huang, et al. On the impact of dissimilarity measure in K-modes clustering algorithm[J]. IEEETrans on Pattern Analysis s and Machine Intelligence, 2007, 29(3): 503-507.
[7] Hsu Chungchian . Generalizing self-organizing map for categorical data[J]. IEEE Trans on Neural Network, 2006, 17(2): 294-304.
[8] Hsu Chungchian, Chen Chinlong, Su Yuwei. Hierarchical clustering of mixed data based on distance hierarchy[J]. Information Sciences, 2007, 177(20): 4474-4492.
[9] Ganti V, Gehrke J, Ramakrishnan R. CAC TUS, clustering categorical data using summaries[C]. Proc of the 5th Int Conf on Know ledge Discovery and Data Mining .New York :ACM , 1999. 73-83.
[10] Ahamad A, Dey L. A K-mean clustering algorithm f or mixed numeric and categorical data[J]. Data & Knowledge Engineering, 2007, 63(2): 503-527.
[11] Ahamad A, Dey L. A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set[J]. Pattern Recognition Letters, 2007, 28(1): 110-118.
[12] 梁吉业,白亮,曹付元.基于新的距离度量的K-Modes 聚类算法[J].计算机研究与发展,2010,47(10):1749-1755.
[13] 白亮,梁吉业,曹付元.基于粗糙集的改进K-Modes聚类算法[J].计算机科学,2009,36(1):162-164.
[14] 张小宇,梁吉业,曹付元,于慧娟.基于加权连接度的改进K-Modes聚类算法[J].广西师范大学学报自然科学版,2008,26(3):189-193.
[15] Dino Ienco, Ruggero G, Pensa Rosa Meo. Context-based distance learning for categorical data clustering[C]. IDA 2009, LNCS 5772, 2009.
[16] 李仁侃,叶东毅.属性赋权的K-Modes算法优化[J].计算机科学与探索,2012,6(1):90-96.
[17] Huang Zhexue, Ng M K, Rong Hongqiang, et al. Automated variable weighting in k-means type clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(5): 657-668.
[18] 吴润秀.基于互信息量的改进K-Modes聚类方法[J].统计与决策,2012,354(6):89-91.
[19] 黄苑华, 郝志峰,蔡瑞初,谢峰.基于相互依存冗余度量的k-Modes算法[J].小型微型计算机系统,2016,8(8):1790-1793.
[20] Pawlak Z. Rough sets—theoretical a spects of reasoning about data[M]. London: Kluwer Academic Publishers, 1991.
[21] 张文修,吴伟志,梁吉业,等.粗糙集理论与方法[M].北京:科学出版社,2001.
[22] 梁吉业,李德玉.信息系统中的不确定性与知识获取[M].北京:科学出版社,2005.
[23] Yao Y Y. Granular computing: basic issues and possible solutions[J]. Proceedings of the Fifth International Conference on Computing and Information, 2000, I: 186-189.
[24] Yao Y Y. Information granulation and rough set approximation[J]. International Journal of Intelligent System, 2001, 16: 87-104.
[25] 张铃,张钹.模糊商空间理论(模糊粒度计算方法)[J].软件学报,2003,14(4):770-776.
[26] Klir G J. Basic issues of computing with granular computing. proceedings of 1998 IEEE International Conference on Fuzzy System, 1998. 101-105.
[27] Liang J Y, Shi Z Z. The information entropy, rough entropy and knowledge granulation in rough set theory[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based System, 2004, 12(1): 37-46.
[28] Yuhua Qian, Jiye Liang, Chuangyin Dang. Knowledge structure, knowledge granulation and knowledge distance in a knowledge base[J]. International Journal of Approximate Reasoning, 2009, 50(1): 174-188.
[29] 梁吉业,钱宇华.信息系统中的信息粒与熵理论[J].中国科学E辑:信息科学,2008,38(12):2048-2065.
[30] 安秋生,沈钧毅,王国胤.基于信息粒度与粗糙集的聚类方法研究[J].模式识别与人工智能,2003,16(4):412-417.
[31] Indrajit Saha, Jnanendra Prasad Sarkar, Ujjwal Maulik. Ensemble based rough fuzzy clustering for categorical data[J]. Knowledge-Based Systems, 2015(77): 114-127.
[32] Yang Yiming. An evaluation of statistical approaches to text categorization[J]. Journal of Information Retrieval, 1999 , 1(12): 67-88.