K近邻和加权相似性的密度峰值聚类算法

赵嘉; 陈磊; 吴润秀; 张波; 韩龙哲

引用本文:	赵嘉,陈磊,吴润秀,张波,韩龙哲.K近邻和加权相似性的密度峰值聚类算法[J].控制理论与应用,2022,39(12):2349~2357.[点击复制]
	ZHAO Jia,CHEN Lei,WU Run-xiu,ZHANG Bo,HAN Long-zhe.Density peaks clustering algorithm with K-nearest neighbors and weighted similarity[J].Control Theory and Technology,2022,39(12):2349~2357.[点击复制]

K近邻和加权相似性的密度峰值聚类算法

Density peaks clustering algorithm with K-nearest neighbors and weighted similarity

摘要点击 825 全文点击 258 投稿时间：2021-08-28 修订日期：2022-12-15

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2022.10810

2022,39(12):2349-2357

中文关键词密度峰值聚类局部密度 K近邻共享最近邻自然最近邻

英文关键词 density peaks clustering local density K-nearest neighbors shared nearest neighbors natural nearest neighbors

基金项目国家自然科学基金项目(52069014, 61962036), 江西省杰出青年基金项目(2018ACB21029)资助.

作者	单位	E-mail
赵嘉^*	南昌工程学院信息工程学院	zhaojia925@163.com
陈磊	南昌工程学院信息工程学院
吴润秀	南昌工程学院信息工程学院
张波	全球能源互联网研究院有限公司
韩龙哲	南昌工程学院信息工程学院

中文摘要

密度峰值聚类算法的局部密度定义未考虑密度分布不均数据类簇间的样本密度差异影响, 易导致误选类簇中心; 其分配策略依据欧氏距离通过密度峰值进行链式分配, 而流形数据通常有较多样本距离其密度峰值较远, 导致大量本应属于同一个类簇的样本被错误分配给其他类簇, 致使聚类精度不高. 鉴于此, 本文提出了一种K近邻和加权相似性的密度峰值聚类算法. 该算法基于样本的K近邻信息重新定义了样本局部密度, 此定义方式可以调节样本局部密度的大小, 能够准确找到密度峰值; 采用样本的共享最近邻及自然最近邻信息定义样本间的相似性, 摒弃了欧氏距离对分配策略的影响, 避免了样本分配策略产生的错误连带效应. 流形及密度分布不均数据集上的对比实验表明, 本文算法能准确找到疏密程度相差较大数据集的密度峰值, 避免了流形数据的分配错误连带效应, 得到了满意的聚类效果; 同时在真实数据集上的聚类效果也十分优秀.

英文摘要

The local density definition of density peaks clustering algorithm does not take into account the influence of sample density difference between clusters with uneven density distribution data, which can easily lead to mistakenly select the cluster centers; the distribution strategy is chained according to the Euclidean distance through density peaks, and flow data usually has more samples farther away from their density peaks, resulting in a large number of samples that should belong to the same cluster being misallocated to other clusters, which result in poor clustering accuracy. In view of this, this paper proposes a density peaks clustering algorithm with K-nearest neighbors and weighted similarities, the local density of the sample based on the K-nearest neighbors information of the sample is redefined, which can adjust the local density of the sample and accurately find the density peaks. The shared nearest neighbors and natural nearest neighbors information of the samples are used to define the similarity between the samples, which eliminates the influence of Euclidean distance on the allocation strategy and avoids the false cascading effect of the sample allocation strategy. The comparative experiments on the uneven density distribution datasets and flow datasets show that the algorithm can accurately find the density peaks of the datasets with large difference of density, avoid the misallocation effect of flow data, and get satisfactory clustering effect. The clustering results on the real datasets is also excellent.