warning message: Columns of X are linearly dependent to within machine precision. Using only the first # components to compute TSQUARED. (# is a number) 翻译:X的列向量在机器精度之内是线性相关的,只使用前#列来计算TSQUARED 对warning的解释 (来自: http://stackoverflow.com/questions/27997736/matlab-warns-columns-of-x-are-linearly-dependent-to-within-machine-precision ): 问题: When I used the function princomp in Matlab to reduce the dimensions of features, it warns:Columns of X are linearly dependent to within machine precision. Using only the first 320 components to compute TSQUARED. What dose it mean? The original dimension of features is 324.I would be very grateful if somebody can answer my question. 解释: (不影响计算结果,只是对数据存在线性关系的提醒) For a more graphic interpretation of this warning imagine your data being 3-dimensional instead of 324-dimensional. These would be points in space.The output of your function princomp should be the principal axes of an ellipsoid that aligns well with your data. The equivalent warning of Using only the first 2 components would mean: Your data points lie on a plane (up to numerical error), so your ellipsoid really is a flat ellipse. As the PCA is usually used for dimensionality reduction this isn't really that worrying. It just means, that your original data is not 324-dimensional, but really only 320-dimensional, yet resides in R^324. You would get the same warning using this random data: N = 100;X = ;X_centered = bsxfun(@minus, X, mean(X)); = princomp(X_centered);plot3(X_centered(:,1), X_centered(:,2), X_centered(:,3), '.'); coeff(:,1) will be approximately and latent(1) the biggest value, as the data is spread most along the x-axis. The second vector coeff(:,2) will be approximately the vector while latent(2) will be quite a bit smaller than latent(1), as the second most important direction is the y-axis, which is not as spread out as the first direction. The rest of the vectors will be some vectors that are orthonormal to our current vectors. (In our simple case there is only the possibility of , and latent(3) will be zero, as the data is flat)
Principal Coordinates Analysis Previous Top Next Principal Coordinates Analysis (PCoA, = Multidimensional scaling, MDS) is a method to explore and to visualize similarities or dissimilarities of data. It starts with a similarity matrix or dissimilarity matrix (= distance matrix) and assigns for each item a location in a low-dimensional space, e.g. as a 3D graphics. Rational PCOA tries to find the main axes through a matrix. It is a kind of eigenanalysis (sometimes referred as singular value decomposition) and calculates a series of eigenvalues and eigenvectors. Each eigenvalue has an eigenvector, and there are as many eigenvectors and eigenvalues as there are rows in the initial matrix. Eigenvalues are usually ranked from the greatest to the least. The first eigenvalue is often called the dominant or leading eigenvalue. Using the eigenvectors we can visualize the main axes through the initial distance matrix. Eigenvalues are also often called latent values. The result is a rotation of the data matrix: it does not change the positions of points relative to each other but it just changes the coordinate systems! Interpretation By using PCoA we can visualize individual and/or group differences. Individual differences can be used to show outliers. Note: There is also a method called ' Principal Component Analysis ' (PCA, sometimes also misleadingly abbreviated as 'PCoA') which is different from PCOA. PCA is used for similarities and PCoA for dissimilaritties. However, all binary measures (Jaccard, Dice etc.) are distance measures and, therefore PCoA should be used. For details see the following box: Why does ClusterVis perform Principal Coordinates Analysis (PCoA, = 'Classical Multidimensional Scaling') instead of Principal Components Analysis (PCA)? Let's look at the differences between PCA and PCoA: Principal Components analysis (PCA) - transforms a number of possibly correlated variables (a similarity matrix!) into a smaller number of uncorrelated variables called principal components. So it reduces the dimensions of a complex data set and can be used to visulalize complex data. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. - captures as much of the variation in the data as possible - principal components are ... * summary variables * linear combinations of the original variables * uncorrelated with each other * capture as much of the original variance as possible Classical Multidimensional Scaling (CMDS) - is similar in spirit to PCA but it takes a dissimilarity as input! A dissimilarity matrix shows the distance between every possible pair of objects. - is a set of data analysis techniques that display the structure of (complex) distance-like data (a dissimilarity matrix!) in a high dimensional space into a lower dimensional space without too much loss of information. - The goal of MDS is to faithfully represent these distances with the lowest possible dimensional space. ClusterVis calculates a principal coordinate analysis (PCoA) of a distance matrix (see Gower, 1966) and calculates a centered matrix. The centered matrix is then decomposed into its component eigenvalues and eigenvectors. The eigenvectors, standardized by dividing by the square root of their corresponding eigenvalue, are output as the principal coordinate axes. This analysis is also called metric multi-dimensional scaling. It is useful for ordination of multivariate data on the basis of any distance function. References: Zuur, A.F., Leno, E.N. Smith, G.M. (2007): Statistics for Biology and Health - Analysing Ecological Data, Springer, New York. ISBN 978-0-387-45967-7 (Print), 978-0-387-45972-1 (Online). Gower, J.C. (1966): Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325-338. From: http://www.sequentix.de/gelquest/help/principal_coordinates_analysis.htm Also see: http://en.wikipedia.org/wiki/Principal_components_analysis http://en.wikipedia.org/wiki/Principal_coordinates_analysis http://en.wikipedia.org/wiki/Factor_analysis http://forrest.psych.unc.edu/teaching/p208a/mds/mds.html http://www.mathpsyc.uni-bonn.de/doc/delbeke/delbeke.htm
Unifrac PcoA 分析中 unweighted 说明,不同环境之间的差异是根据不同环境特有的分支长度来区分的。 The UniFrac Metric Calculating the UniFrac metric. The majority of options in the UniFrac interface make comparisons based on the UniFrac metric. The UniFrac metric measures the difference between two environments in terms of the branch length that is unique to one environment or the other. In the tree on the left (below), the division between the two environments (labeled red and blue) occurs very early in the tree, so that all of the branch length is unique to one environment or the other. This provides the maximum UniFrac distance possible, 1.0. In the tree on the right, every sequence in the first environment has a very similar counterpart in the other environment, so most of the branch length in the tree comes from nodes that have descendants in both environments. In the example, there is about as much branch length unique to each environment (red or blue) as shared between environments (purple), so the UniFrac value would be about 0.5. If the two environments were identical and all the same sequences were found in both, all the branch length would be shared and the UniFrac value would be 0. Weighted Unifrac Metrci 是为了区分相似或相同的序列,增加有多个序列分支的分支长度的比重 Calculating the Weighted UniFrac Metric. The UniFrac metric described above does not account for the relative abundance of sequences in the different environments because duplicate sequences contribute no additional branch length to the tree (by definition, the branch length that separates a pair of duplicate sequences is zero, because no substitutions separate them). Because the relative abundance of different kinds of bacteria can be critical for describing community changes, we have developed a variant of the algorithm, weighted UniFrac, which weights the branches based on abundance information during the calculations. Weighted UniFrac can thus detect changes in how many organisms from each lineage are present, as well as detecting changes in which organisms are present. The figure below illustrates how the Weighted algorithm works. Branch lengths are weighted by the relative abundance of sequences in the square and circle communities; square sequences are weighted twice as much as circle sequences because there are twice as many total circle sequences in the dataset. The width of branches is proportional to the degree to which each branch is weighted in the calculations and the grey branches have no weight. Branches 1 and 2 have heavy weights since the descendants are biased towards the square and circles respectively. Branch 3 contributes no value since it has an equal contribution from circle and square sequences after normalization for different sample sizes. 其中 normalized weighted 选项通过将 Unifrac distance value 除以距离比例常数 D 使不同进化速率的分类单元在计算 unifrac distance 时是按照同等的比例对待。进化快的分支 分支长度长,用 normalized weighted 方法会将长短不一的分支按照分支中个序列距离 root 的平均值按比例比较,就是去比例比较。 Normalizing Weighted UniFrac Values. If the phylogenetic tree is not ultrametric (i.e. if different sequences in the sample have evolved at different rates), comparing environments with the Cluster Environments or PCA analysis options using weighted UniFrac will place more emphasis on communities that contain taxa that have evolved more quickly. This is because these taxa contribute more branch length to the tree. In some situations, it may be desirable to normalize the branch lengths within each sample. This normalization has the effect of treating each sample equally instead of treating each unit of branch equally: the issues involved are similar to those involved in performing multivariate analyses using the correlation matrix, to treat each variable equally independent of scale, or using the covariance matrix, to take the scale into account. Normalization has the additional effect of placing all pairwise comparisons on the same scale as unweighted UniFrac (0 for identical communities, 1 for non-overlapping communities), allowing comparisons among different analyses with different samples. The scale of of the raw weighted UniFrac value (u) depends on the average distance of each sequence from the root. The normalization to correct for this effect is performed by dividing u by the distance scale factor D (see equation below), which is the average distance of each sequence from the root weighted by the number of times each sequence was observed in each community.
这是一篇翻译的文献,看看有需要的吗 ? 主成分分析教程 -a tutorial on principle components analysi s Lindsay I Smith, February 26, 2002 第一章 引言 本教程为读者提供主成分分析( PCA )的基本原理。 PCA 是一种统计技术,应用于人面部识别和图像压缩,是在高维数据中提取模式的一种一般技术。 开始描述 PCA 之前,本教程首先向大家介绍一些相关的数学概念,包括标准差、协方差、特征矢量和特征值。这些背景知识与 PCA 章节直接相关,熟悉的读者可以直接跳过。 贯穿全教程的一些例子用来描绘讨论过的概念。如果需要进一步的信息,可以参阅由 Howard Anton 编著, John Wiley Sons Inc 出版的“ Elementary Linear Algebra(5e) ”(线性代数基础第五版)。 这里不支持公式,怎么办呢? 还是引用我原来的文章吧, http://jacobyuan.blog.sohu.com/148317731.html