博文

用于聚类验证的R包：clValid

已有 5188 次阅读 2017-7-10 21:53 |个人分类:生物信息|系统分类:教学心得|关键词:学者

聚类是一种无监督技术，用于在多维特征空间中对彼此接近的对象进行分组，通常是为了揭示数据所具有的一些固有结构。聚类是高通量基因组数据分析中常用的一种方法，其目的是将具有相似表达模式的基因或蛋白质组合在一起，并可能共享共同的生物通路。

目前存在大量的聚类算法，其中许多算法在分析基因组数据时表现出了一定的希望。

为了验证聚类分析的结果，并确定哪一种聚类算法在某一特定实验中表现最佳，各种措施都被提出。这种验证可以完全基于数据的内部属性或外部引用，以及单独的表达式数据或与相关的生物信息结合在一起。

clValid 提供函数，用来验证聚类分析的结果。它提供三种聚类验证方法：“internal”，“stability”和“biological”。

1） “internal”只将数据集和聚类分区作为输入，并使用数据中的内在信息来评估聚类的质量。

2） “stability”是内部措施的一种特殊形式。他们将聚类结果与每次删除一列后得到的聚类进行比较，从而评估聚类结果的一致性。

3）biological评估聚类算法产生生物意义上的聚类的能力。

对于internal validation，选择了反映聚类的紧凑性(compactness)，连接性(connectedness)和子聚类的分离度(separation of the cluster partitions)。

1）连通性(connectedness)，涉及到在相同的聚类中最近的点到底有多么接近。

2）紧凑性(compactness)评估集群的均匀性，通常采用在簇内的方差，而分离量化分离集群之间的程度（通常通过测量聚类中心之间的距离）。

由于紧凑性和分离度表现出相反的趋势，通用的方法将它们合并成一个的分数。

同时，它提供九种聚类算法，包括hierarchical, K-means, self-organizing maps (SOM),model based clustering。

UPGMA（Unweighted Pair Group Method with Arithmetic Mean）. It is an agglomerative, hierarchical clustering algorithm that yields a dendogram which can be cut at a chosen height to produce the desired number of clusters.

K-means. It is an iterative method which minimizes the within-class sum of squares for a given number of clusters.Often another clustering algorithm (e.g., UPGMA) is run initially to determine starting points for the cluster centers.

Diana. It is a divisive hierarchical algorithm that initially starts with all observations in a single cluster, and successively divides the clusters until each cluster contains a single observation.

PAM. Partitioning around medoids (PAM) is similar to K-means, but is considered more robust because it admits the use of other dissimilarities besides Euclidean distance. Like K-means, the number of clusters is xed in advance,and an initial set of cluster centers is required to start the algorithm.

Clara. It is a sampling-based algorithm which implements PAM on a number of sub-datasets. This allows for faster running times when a number of observations is relatively large.

Fanny. This algorithm performs fuzzy clustering, where each observation can have partial membership in each cluster.

SOM. Self-organizing maps is an unsupervised learning technique. SOM is based on neural networks, and is highly regarded for its ability to map and visualize high-dimensional data in two dimensions.

Model based clustering. Under this approach, a statistical model consisting of a finite mixture of

Gaussian distributions is fit to the data.

SOTA. Self-organizing tree algorithm (SOTA) is an unsupervised network with a divisive hierarchical binary tree structure.

关注“如何玩转生物大数据”微信公众号，及时获取更多内容

转载本文请联系原作者获取授权，同时请注明本文来自冀颜科学网博客。
链接地址：https://m.sciencenet.cn/blog-3291578-1065612.html

上一篇：测序数据质量控制：多样本的fastqc结果，一目了然！
下一篇：“如何玩转生物大数据”系列：TCGA的样本注释信息和数据类型统计

收藏分享

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

冀颜

扫一扫，分享此博文

jiyanbio1983的个人博客分享 http://blog.sciencenet.cn/u/jiyanbio1983

博文

用于聚类验证的R包：clValid

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

冀颜

全部作者的其他最新博文

全部精选博文导读

相关博文

jiyanbio1983的个人博客分享 http://blog.sciencenet.cn/u/jiyanbio1983

博文

用于聚类验证的R包：clValid

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

冀颜

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)