Here we store material for the Data Analysis part of the CB2030, Systems Biology course.
This project is maintained by statisticalbiotechnology
Nice catch. I am just missspeaking. As a correlary you can see in the code that I am taking values from Vt.
No, not really. However, the principal component will show that the variance of some variables are better described as a linear combination of the variables. That is would be a sign of co-variation of the variables.
All data analysis aims at explaining the variation in the data. PCA is a nice way to do so for covarying data. The point is that you factor out the covariation to specifically study the covariation.
PCA focuses more directly on the explanation of the underlying phenomena driving the variation of the data. If there is one or a few factors with a linear influence to many of the observed variables, then PCA is a great method to find such factors. It is a very good idea to perform both types of analysis side by side to see if you get corroborating results.
No PCA is not a clustering technique. The analysis itself does not give you any groups of data points. PCA have multiple applications. It can be used for visualization, for dimensionality reduction and as a technique to give mechanistic explanation of any linear effects that you are trying to assess. It can also be used for determining biases of your patient material, for missing value imputations. A very nice and applied explanation: One example is in genomics, where, when we set up an experiment, we have biological and technical replicates. Usually what we are looking for is a biological difference between two conditions, and to be sure that this is the case PCA can be used as a kind of quality control. What we want to observe is that most of the variance is explained by biology and not by technical factors, i.e. that principal component 1 is larger than principal component 2. I hope that this was clear enough and a correct explanation. If not, looking at the PCA of this article might help: Here the authors RNA-sequenced for differential gene expression on Zika-virus infected cells using two different sequencing platforms (MiSeq and NextSeq). When they did the PCA, they could show that 50 % of the variance between the samples was due to whether the cells were infected with Zika virus or not, and 20 % the variance was due to the platform that they used (MiSeq or NextSeq).
They are directly related to the amount of variance explained by each principal component. As explained here, you can directly use them to calculate the proportion of the variance explained for each PC.
Variance of uncorrelated variables are always additive. So, are the variances explained by the principal components.
No, not if all dimensions are used. Otherwise, yes, obviously.
Yes, subtile differences migh be lost in the analysis.
Yes, SVD is a method for PCA. A more elaborate answer can be found here.
Normally one use SVD as it is more numerically stable than other methods to compute PCA.
No all Principal components are orthogonal to each other. N-dimensional spaces can contain max N vectors that are orthogonal to each other.
You would normally assume that a tall person allso is a heavy person. This implies that there is a linear relationship between squared length and weight. That relationship will be explaining most of the variation of your data, and end up in PC1. The PC1 would normally not be interesting to study, at least not in e.g. T2D studies. However, hopefully, PC2 will capture the BMI, which might be of larger interest.
Good question. After the transformation the PCs will have an eigen value attached to them so that you know how much of the variance they explain. Particularly PC1 will be the linear combination of data that optimally explains the variation of the data.
These are linear combinations of features, hence ther is no single feature to refer to.
In general, features that covary between the PCs can be signs of there being a bias in the patients, that we might be able to detect by studyin the eigen genes.
An eigen gene is the linear combination of patients that would explain most of the variation of the data.
No this is an arbitrar choice. You can fip the signs of the elements in the eigen patients and the eigen genes and they would still explain the same thing.
I like the concepts of Kluwer et al.. Eigen genes is the best representation of how a gene behave, and eigen assays (or eigen experiments, eigen patients) are representations of how patients behave.
There are some subtle differenes between PCA and SVD, so they PCs and the eigen vectors differs a by a scalar factor. Modulo those differences: If the data in the introduction to PCA was gene expression values for two different patients, the components in the figure for cell 6 would be the eigen patients one and two and the direction of the axises of the input space is the eigen genes. The eigen gene equivalent is also available through
pca.components_
in cell 4.
The first eigen patient, is a what a gene expression profile from the single patient that would explain most of the variation in the data would look like.
They PCs are telling you what linear combinations of variables that will explain the most of the variation of the variables.
That part of the answer is not essential for the understanding of PCA, however: A covariance matrix is a description of how much variation you have in the data in each dimension(in the diagonal) as well as how much the datapoints dimensions are covarying with (non-diagonal elements). The answer makes the point that PCA is a projection from input space to a co-ordinate system where there is no covariation in the data, i.e. the covariation matrix is diagonal.
Neither of the techniques are clustering techniques. PCA is a simpler technique than tSNE, and gives mechanistical insights to the data. It’s worth noting as well that many consider tSNE to be overused.
I guess you mean factor analysis. PCA is a type of factor anaysis.
The upper bound of the number of the PCs are the rank of the expression matrix, which often equate to the min of number of columns and rows of the analyzed matrix.
It is a matter of a) how much variance remains to be explains and b) your taste.
Both the eigen genes and eigen assays of Kluwer et al are unitary vectors. There is no significance associated with them, per se.
The question directly relates to how much of the variance you want to cover.
from numpy.linalg import svd
X = combined.values
Xm = np.tile (np.mean(X, axis=1)[np.newaxis].T, (1,X.shape[1]))
U, S, Vt = svd(X-Xm, full_matrices=True, compute_uv=True)
What is actually happening here? I do understand that the generated output is the U matrix containing the eigensamples and genes, S matrix which is the singularvalue matrix and Vt which is the samples and eigengens. However, I do not follow what is happening beforehand and what the Xm is used for.
Great question. We often centralize (i.e. remove the mean of each probe) our data before SVD. In the code,
np.mean(X, axis=1)[np.newaxis].T
will render a one dimensional matrix with the mean expression value for each gene. Subsequently, youtile
, i.e. copy that 1-dimensional vector inX.shape[1]
copies into a matrix od the same dimension asX
.
Not in the standard formulation of PCA. However, there are kernelized versions of PCA
I am guessing that you are reffering to Kluwer et al. This is hard to explain if you did not read algebra and geometry in your curiculum. r is the rank of the matrix, which in some cases can be lower than the lowest dimension n.
r<n only when a column or row in the matrix X is a linear combination of the other.
The genes with the higest resp lowest Eigen patients 1 and 2.
Your interpretation that KRT17 drives both eigen patient 1 and 2 and hense is a better explanation of the data than the other genes could be right. KRT17 is a known cancer related gene. At this point of the analysis, it is not fully clear what the biological interpretation of PC2 is. From the eigen gene plot, it is however clear that PC1 seem to capture the difference beween LUAD and LUSC.
It would be helpful if you point out at what part of the paper you stopped understanding the text.
Vectors?
Yes that is right. PC1 describes more or equal of the variance than PC2.
No, you have <= min(m,n) principal components.
I excluded this part of the text as preparational material, as it might be confusing, The word [projection}(https://en.wikipedia.org/wiki/Projection_(mathematics)) has the meaning of a function that maps data points in one space to another space. Also, in the eigen digit example, VanderPlas only keep 8 values, but they do so together with their “basis” (eigen digits), which each have the same dimension as the orignial images. However, given that they have a large number of images 8 values per image + 8 eigen digits is a compression of data.
They are given by the SVD.
The notebook contains an example of PCA on 20k dimensions. There might be that you can hit an over limit by using even more dimensions, however, I have not experienced any problems with the size of the problem.
SVD is an algorithm for performing PCA (give and take some small considerations).
A more elaborated answer:
“What is the difference between SVD and PCA? SVD gives you the whole nine-yard of diagonalizing a matrix into special matrices that are easy to manipulate and to analyze. It lay down the foundation to untangle data into independent components. PCA skips less significant components. Obviously, we can use SVD to find PCA by truncating the less important basis vectors in the original SVD matrix.”
Yes I can.
Not very much more than i.e. the description in Kluwer et al. Hopefully this will become clear after the seminar.
The eigenpatient contains the variation within the patients, e.g. the first eigenpatient contains the most descriptive difference between the patients. So the eigenpatient is a vector containing one value per gene that gives the relative weight of that gene. Conversely, the eigengenes contains the variation within the genes, e.g. the first eigengene contains the most descriptive difference between the genes. So the eigengene is a vector containing one value per patient that gives the relative weight of that patient.
Right, it might be a bit abstract at first. I will try to run some examples during the seminar, if you were not convinced by the examples in the video.
The S matrix is a diagonal matrix containing the eigenvalues. The eigengene matrix is transposed as it is the way it comes out from a SVD.
No, there is no individual sample or gene that takes the same shape as the eigengene. They are an average behavior across all genes/samples.
It is not a requirement on the problem, instead it is a property of the analysis. 1st PC contains the most of the variance, 2nd PC the next-to most of the variance, etc. As the PCs are ordered, each new component adds a smaller and smaller contribution to the variance explained. Hence, the type of plot you see above.
The S matrix (also called sigma matrix) in a SVD contains the eigen-values. In this context, they are indicating the relative importance of the PCs of how well they explain the variance within your gene-expression matrix.
That is entirely possible.
The amount of explained variation is a typical mean to identify which components that are relevant.
No.
Yes that is all a PCA does for you, it finds a rotation where the the variance in the data is best explained in order of the principal components. * Is it that when this is done computationally (and not by eye as in the explanations) the best way to compute this is to base it on calculations of covariance between all data points? No, there is no separate covariance calculations involved. Just let the SVD do its job. In practice the PCA replaces linear combinations of covarying variables with new variables.
VanderPlas means that you can make a dimensionality reduction of your problem by just selecting a subset of the principal components.
You frequently strive to reduce your problem to as few components as possible, but not less.
You have to choose the number of dimensions yourself.
You are free to include as many components you want. However, they are sorted in the order of amount of explained variance. Hence if your first two components did not explain that much variance your third component is also not likely to explain much.
Any situation where you would like to explain the covariation in multi-dimensional data with a fewer number of components.
No, I do not think that is right. Neither of the methods are using regularization.
There is no really nice way to characterizing overfittings to data for unsupervised learning. However, just as for supervised learning, the more parameters you introduce the more prone an algorithm is prone to fit a particular dataset. Conversely, if you select a too rigid model (with too few parameters), it will fail to generalize.
VanderPlas mention SparsePCA and explains that it uses a regularization scheme called L1-penalty, i.e. it penalizes not just for squared errors of the decomposition, but also for number of non-zero elements in each principal components. Feel free to try the method yourself in e.g. the jupyter notebook.
None. Eigengenes/eigensamples is just a more direct nomenclature.
The eigenassay contains the variation within the patients, e.g. the first eigenassay contains the most descriptive difference between the assays. So the eigenassay is a vector containing one value per gene that gives the relative weight of that gene.
Yes, the nomenclature in the notebook follows the section 2 of Kluwer et al.
Yes, Kluwer et al. calls principal components eigengenes and eigenassays.
I dissagree, VanderPlas states that: “Certainly PCA is not useful for every high-dimensional dataset, but it offers a straightforward and efficient path to gaining insight into high-dimensional data.”, that is not really the same thing as you state. PCA is frequently used in high dimensional data, as we are in more need of dimensionality reduction in high dimensional data.
The main limitation is that PCA is a linear technique. If the association between two variable is not linear, you might not catch such associations.
Weak signals can be sometimes be captured by other clustering methods as well. What is great with PCA is that it allows you to strip of one effect at the time from the data (just as Kluwer et al. demonstrates).
Each of the two sets contains patients’ gene expression values (about 20k gene exression values per patient). Here the data is given as a 500 x 20k array, patients as columns, genes at rows. You add the patients containing the two types of cancer in one large set,which here is represented as a 1k x 20k matrix. This is a typicalthing that is easier viewed in the notebook directly.
You conduct your experiment exactly the same way for all samples. That is easier said than done, though.
One related type of analysis that is using a different orthogonality criterium than PCA is Independent Component Analysis (ICA).
t-SNE is a non-linear technique. We wont cover it here, but it is great for visualisation of multidimensional data. Feel free to try it out in the notebook example, it is relatively straightforward to implement.
Why don’t you find out by trying them out?
You! It is a subjective call.
No I do not say that they are more relevant, just that they describe the most of the variation in the data. They could contain outlier data, so it might be worth plotting their expression values.