Here we store material for the Data Analysis part of the CB2030, Systems Biology course.
This project is maintained by statisticalbiotechnology
I do not fully get the “dynamic nature” of pathways, but your first point resonates well with me. I would add the methods inability to detect covariation between the gene products as an additional weakness.
Yes, that is important, but detecting novel pathways are not a goal of regular pathway analysis.
No, there is differences between the databases. Even the number of pathways in the database will effect your results, if you do your multiple hypothesis correction in an accurate way.
This is an open research topic within multiomics analysis. It should be entirely possible to do so, however, I have seen very few successful implementations. One problem is lack of pathway definitions. However, some pathway definition databases like KEGG and Rectome actually do contain information on both metabolites and proteins.
It is not so much that the larger pathways get higher significance by chance, as that they have a larger chance to pick up weak signals.
It is their way of saying that you are giving a common measurement of gene expression for the genes in the pathway instead of measuring the activity of individual genes. As explained in the video lecture, you count the genes intersection between the differentially expressed genes and the pathway. That number is the single pathway-level statistics.
Are these two alternative mutually exclusive? Preferably according to both criteria.
The first statement just comes from combinatorics. And you do not get to “choose” which proteins that were measured, it is a given from the data.
All the genes that you were taking in consideration in the first place, i.e. all genes you have measured. The two other choices you list will lead to biases.
No, as they are the same test.
I plotted the distribution of L and S for a fictive pathway in my slide. If the distributions are significantly different we call that a regulation of the pathway. That is also what Subramanian et al. are measuring.
There are some efforts to handle this. For instance, at least one study uses the absolute value of the differences between case and control rather than the difference itself. However, I do not think that method is frequently used. GSEA it self solves the problem by having two different tests, one for up and one for down regulation. The problem, however, is that we are not making use of the most striking feature of this analysis, the co-variation between the gene-products we are studying. We are currently developing methodology for this in my lab.
Yes, it means that the downwards trend is stronger than the upwards trend at that part of the plot. However, gsea cares about the “leading edge”, i.e. the left most part of the plot.
Could you help me by explaining how their proposed method would not “determine whether members of a gene set S tend to occur toward the top (or bottom) of the list L”?
No, the calculations should be quick. However, the
gseapy
package does not execute its code locally, but instead submits each geneset as a job at the gsea or enrichr servers. This slows down the execution.
Yes! Or at least that this implementation of pathway analysis does not pick up much signal.
It all depends on what you are interested in reporting as a result.
Yes, your question is an accurate description of how a permutation tests work.
I think that they mean that if they would exchange individual gene+sample pairs with each other, they would get an expression matrix that would be too easy to distinguish from a real matrix. It should be noted, that [permutation tests}(https://en.wikipedia.org/wiki/Resampling_(statistics)) is seen as a gold standard for non parametric testing.
p value for differential expression is a suitable sorting score. In practice, the input to the p value calculation procedure, the t statistic is frequently used. However, this might not be what you ask for (despite your wording of your question). The GSEA method includes a score for different rankings (which is shown in the lower subplots in the notebook). That score is part of the secret sauce of how they score leading edge genes. If you want the arithmetical details, they are given in Equation [1] of the Subramanian et al. paper.
Under the null hypothesis, the pathway’s genes should be randomly ranked. Under such a condition the up or down regulation, i.e. the ES should follow a random walk process. Hence, if the path seems extreme, it is a sign of enrichment.
“phenotype label is permuted” => permute the relation of case and control.
Yes.
It is mostly used as it is easy to calculate. It also do not require quantative measurements.
You might not have any quantitative data. E.g. you have used an antibody array to detect the presence of a set of proteins.
One major advantage of ORA is that the method do not need expression value. I.e. if you have identified a set of proteins using antibodies, you just now that they were present in the sample or not, and you would not know the proteins’ concentration. Then you can apply ORA but not GSEA. One usualy prefer GSEA when one have expression values, ORA when you for some reason do not.
Both analysis are frequently used.
The analysis is completely dependent on the curration of pathway databases. However, for unannotated speices assignments to pathways can often be done by searching for orthologs in organisms included in pathway databases.
It is challenging to annotate data. In generla this is done as community efforts by various underfunded angels across the world.
You should mine for pathways that you are interested in. Other than that I do not have any recomendations in this matter.
Not the way we traditionally do pathway analysis. There is for instance no mechanism that subtracts the effects of significant subpathways from an over arching pathway (if you for some reason would like to do so).
This is a real problem. Several efforts are using network inferences by PPI, co-exression or other networks to add external information to pathway analysis. A close-by example is this article.
Many pathway databasess, such as reactome, are hirarchical. In such definitions sub pathways are ecompassed by larger pathays. E.g. a pathway “Phosphorylation of STAT2” is a part of the larger pathway “Immune System”.
Automatic annotation does not work well. The state of the atr is manual annotations from literature, possibly with aid of algorithms.
# | p | Overlap | Term | q |
---|---|---|---|---|
0 | 6.142575e-11 | 24/124 | Cell cycle | 1.554448e-08 |
1 | 1.631457e-04 | 17/160 | Cellular senescence | 2.064293e-02 |
2 | 3.322946e-04 | 7/35 | Alanine, aspartate and glutamate metabolism | 2.803030e-02 |
does it mean that since the cell cycles’s p-value<threshold then those 24 genes that overlap that are differentially expressed are just relevant for the pathway per se? or does it have another meaning? could you please explain more thoroughly this part?
24/124 means that there were 124 genes in the “cell cycle”’s pathway, whereof 24 genes were differentially expressed at a q<10-15.
No there is no rule for this. However, you that have taken this course ofcourse recognize that it never should be set based on measures that have not been corrected for multiple hypothesis.
The documentation lists a couple of methods some of them need samples group of at least 3, which might not be available. There might be cases where you rather sort by FC than significance. See above.
The advantage is that unlike ORA, GSEA studies the full distribution of expression values.
This is a nice observation. It is more common to use categorical variables, but it nothing that hinders the same analysis to be done for continuous variables. Gseapy does not seem to support this though.
The documentation lists a couple of methods some of them need samples group of at least 3, which might not be available. There might be cases where you rather sort by FC than significance. See above.
I had hoped that this would be a slam dunk for you that have read Downey, and been diving further into the word of permutation test than most. It is easy: you permute the labels and get new differential expresion values, for which you can calculate a new enrichment score. If you do this enough times you can asses how unlikely random your unpermuted outcome was.
Under the null the distribution of genes in the pathway are random in respect to the other measured genes. Processes involving random dicrete calls are frequently called random walk.
The first, i.e. that the reported statistics is inflated as it does not take gene-gene correlations in account.
Protein concentrations can be seen as gene expression. But, Yes you quite freuently see pathway analysis based on protein abundances. Maybe even more frequently, pathways are used within metabolomics, i.e. with the abundances of various small molecules.
Yes the interpretation of this score will be challenging unless you read the GSEA paper. See e.g. Figure 1
In single cell data and bulk data you are measuring an similar amount of mRNAs or pathways. So, the data is not very different in this respect. However, in single cell, we are often less prone to look at e.g. differential expression. Instead one are more prone to use e.g. clustering to find various cell types or similar.