Here we store material for the Data Analysis part of the CB2030, Systems Biology course.
This project is maintained by statisticalbiotechnology
No, this should not, at least not theoretically, be the case. The model should regress away the larger effect of the tumor grade, rendering the further tests data where the linear effect of the tumor grade is absent.
The difference in KNAP2 expression between node=1 and node=0 is positive for grade=1 tumors, but negative for grade=3 tumors. Hence there is a tempting interaction between these variables. We subsequently test if this interaction is significant.
Yes, the notebook, for instance, look at KNAP2 expression’s dependency on the interaction between tumor grade and if the patient’s lymph nodes have been removed or not.
Not correct! We use the notation C(.), when we want to emphasise that a variable is categorical and not continuous.
I think that if you use x = rng.rand(50) you will get 50 random numbers in the range 0<x<1, while if you multiply this by 10, you will get them in the range 0<x<10.
A2: Yes this is a valid explanation. I get the following results when testing similar code:
import numpy.random as rng rng.rand(5)
array([5.71024323e-01, 5.82879495e-01, 2.57541449e-01, 2.28721560e-01, 5.37681066e-04])
10*rng.rand(5)
array([2.2329335 , 9.55115181, 5.90653729, 6.60637405, 5.64723141])
I always use the seaborn package, however there are several other packages available, such as bokeh, ggplot & plotly. I think it is very important for you all to learn one such package well. If you as engineers ever get a job that do not involve displaying data for yourselves or others, you should consider if you are the right person at the right place.
Seaborn calculates these confidence intervalls by bootstrapping. They are meant as a reality check for you as a user, how stable are your estimated regression lines.
In my slides I show that one can use very similar looking models for categorical and linear variables. The residues will in both cases be the differences between the model and the measurement data.
ε = Y-f(X). Depending on what function f(X) we test the residues will be different.
No, we usually model the error term as an addition to the effect from the independent variables.
Yes, the assumptions of the distribution of the residuals have to full-filled for the tests to be valid.
I should have done a better job in explicitly stating that I do not want you to cover how F-tests are calculated, including the theory behind degrees of freedom. In this course I want to treat them as as input to the black box of f-tests for significance testing.
df is an abbreviation of degrees of freedom. The more parameters you include in your model, the more you lower df of the residuals. This often lowers the sensitivity of the test.
See wikipedia
I specifically want to avoid explaining how we form sampling distributions, so I leave this question for yourself to find out. But in short, you use two parameters in your model, hense you are left with n-2 degrees of freedom after the fit.
Here is an explanation of a t test.
n-2 is the degrees of freedom of the residues. I am not covering the techniques of how significance is calculated in this course, so I leave this for your own indulgence.
I should have done a better job in explicitly stating that I do not want you to cover how F-tests are calculated. In this course I want to treat them as as a black box for significance testing. If you want to know how they really operate there are several on-line resourses to choose from, as well as most standard statistics course that put lots of effort in how they are constructed. However, as said, just see them as a vehicle for the calculation of p values from your data and linear models.
Yes, ANOVA’s are dependent on F tests.
I select two reasons: 1) It makes the maths easier. The derivate of e(X)2 is 2e(X). Absolute values are trickier. 2) Subsequently want to use F tests to evaluate the models, which are defined for squared normal distributed parameters. By minimising the squared residues we get the input parameters for the F test for free.
There is no rule. Check the properties by evaluating a F-tests yourself, or do as countless students done before you, use a table.
No, I will not explain how ANOVA’s are calculated.
No you get the message wrong. For n>30, a t-distribution is asymptotically equal to a normal distribution. If you want to learn more about sampling distributions feel free to discover their beauty by your own.
The book actually states that the variables clearly are not the uncorrelated, however, that their independence is a good enough approximation.
Yes.
None. For historical reasons, we call some special cases of f tests t tests.
No, for most tests in the notebook more than one variable is tested. t-tests can only be performed in case-control situations, or when a single sample is to be compared to a given value.
Because you frequently want to test the difference between two groups.
The best answer is that we should only check the relations that make sense prior to the experiment.
They mean that you should evaluate an ANOVA for each step of the model building. I am a bit sceptic to these types of methods, as they indeed offers a possibility for p value hacking. You should instead just test the relations that make sense prior to the experiment.
Not really. However, all three suggested schemes are a form of data dredging, or p-value hacking.
Yes the residues are effected by which model we select.
I liked the textbook’s explanation: “The rationale for this principle is that if X1*X2 is related to the response, then whether or not the coefficients of X1 or X2 are exactly zero is of little interest. Also X1*X2 is typically correlated with X1 and X2, and so leaving them out tends to alter the meaning of the interaction.” Particularly the second part is easy to understand. If there is a small dependency on X1 or X2, that we do not regress away, it will be hard to tell what a significant dependency of X1*X2 means.
Not directly. If we select an overcomplicated model, we will get insignificant results, as we are decreasing the degree of freedom of the residues. However, it is easy to overfit to the data by incrementally refining your model to the data.
We normally just take interest in the significance, i.e. p value, of each parameter of the model.
There are several ways to do non-linear fits of data. However, it is often hard to evaluate significance of such models if they are not linear. However, just as you state, there are unsupervised methods to find relationships in data, that will not affect our ability to subsequently test data, e.g. PCA and clustering. These are seldom used in this context. In my mind that is a pity.
I do not fully follow the rational behind looking at the mean of the regression lines.
Linear regression is a type of analysis that tries to model the relationship between two (or more in multiple linear regression) features by using a linear equation and fitting it to you data. Beta0 and Beta1 would in this model indeed represent the intercept and the slope.
Finding these terms means fitting your data to to a linear equation, and it would result in you having estimates of Beta0 and Beta1, but I think it does not necessarily mean there is a relationship between X and Y.
You can however assess the accuracy of these coefficients by calculating the standard error for each of them (e.g. SE(Beta1)). You can also determine the p-value for the null hypothesis “There is no relationship between X and Y”, which can be used to infer that there is a relationship or not, depending on the size of the p-value. A small addition to a nice answer: The word “population” is problematic in the question. With a linear model we try to establish if there is a linear linear dependency between one dependent variable, say $Y$, and another independent variable, say $X$ for the individuals within the population. The linear test to do such test is written “Y~X”. That translates to “Y=aX+b”.
In the text they use one relationship Y(X). They generate a number of different X,Y samples and calculate regression lines for all these samples.
The text book simulates a some measurements of a relationship between two variables X and Y. They add a normal distributed error in the simulation. They call this error ε. The parameters of the normal distribution were selected according to the author’s ideas of a suitable size of an error.
Your question is problematic. It contradicts a statement on frequentist hypothesis testing, which I think I stated a couple of times during the previous seminars: We are never calculating any probabilities under H_1. This is why it relatively easy to calculate p values.
There are several null models. One per parameter tested.
No, we normally select H0 to be that there is no dependence on the categorical variable and the population parameter we test.
This is dependent on the effect size of the phenomena you are investigation. One often do a power analysis prior to the experiment to find out.
Yes. Try it out for yourself in the notebook by subsampling the data.
No. This depends on the effect size.
It is still reliable in the sense that we can calculate significance, however, the tests become insensitive when confronted with little data.
No, just present p value.
None. Normally we use $p$ values.
We are minimizing the summed square of the residuals, which is a good thing if you have normally distributed residuals. That is not always the case. There are several other techniques for optimization that can be used. Still linear models are popular as they are easy to interpret.
A descriopion of the calculus can be found in this blogpost
The F-statistic is the statistic we use when we perform F-tests. Roughly it is the quota between the variance explained by the model to the variance of the residuals.
There are several reasons for this.
If you in 1. mean that you look at the how much the slope variable explains the variance in the data, 1 and 2 are equivalent. In the book they use permutation statistics, and under such conditions, 2. is easier to simulate.
No, but the more independent variables you include, the less power you get to prove an effect of each variable. A quote from von Neumann: “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”.
The author states: “To visualize the residuals, I group respondents by age and compute percentiles in each group, as we saw in Section 7.2. Figure 10.2 shows the 25th, 50th and 75th percentiles of the residuals for each age group. The median is near zero, as expected, and the interquartile range is about 2 pounds. So if we know the mother’s age, we can guess the baby’s weight within a pound, about 50% of the time.”
Minimizing the sum of squares, any variant involving a division of a constant (R
2 ) or the mean of the root of them (as RMSE) are equivalent procedures.
The RMSE is telling of how much variance that remains to be explained after we have deduced our linear model from the data points
I think the book provides a simple strategy for non-linear relationships.
To construct a model where the dependent variable consists of boolean data, a logistic regression is used. For data where the dependent variable is an integer count, a Poisson regression is used instead. How do these differ, both in their related distributions and the ways to construct their models?
Here is a blogpost on the subject.
More or less, yes.
Could you extrapolate reagarding logistic regression? In 11.6 they say that it can be used for non-numerical dependent variables and use boolean as an example, are there other non-numerical dependent variables?
I think you are right in that the square of the age seems to be an arbitrary choice. He most likely just selected this as an example.
In logistic regression the dependent variable is expressed as log[p/(1-p)]. This results in p taking values between 0 and 1.
A [Boolean] variable can only take two values, like true or false. An integer can take any integer value. In the case of a Poisson regression the integers are limited to positive integers including zero. An example of a boolean variable is do you have Covid-19, an example of a count is, the number of students in this class have Covid-19.
The procedure in Downey simulates the sampling process, by sampling with replacement. This is known as bootstraping. We are studying the same sample values over and over again, but in a new context each time.
You just mix up the links between the dependent and the independent variables. Or as Downey does, the link between the residuals and the independent variable.
General question: What consistutes good and bad practices respectively, when it comes to resampling and other bootstrap-oriented methods?
The point is that the text try to make is that if we have different amount of samples from a certain part of the population we need to compensate for that effect. Say that you have a gender bias in your study, it is good practice to compensate for the geneder-effect when resampling. If you want to make an early prediction of how the mail-in woters in the US elections are going to behave it is nice to compensate for the fact that mail-in woters have different party preferentials that the ones turning up in the election boots.
“To correct for oversampling, we can use resampling; that is, we can draw samples from the survey using probabilities proportional to sampling weights “ (Allen B. Downey, Chapter 10.7 Weighted resampling). Is the author referring to resampling methods such as permutation tests, bootstrap or Jackknife or is he simply rerunning the test with weights applied? “If you oversample one group by a factor of 2, each person in the oversampled group would have a lower weight” (Allen B. Downey, Chapter 10.7) Does the application of weights can reduce significant differences in one representing group if the weights are applied on a dataset which is not large enough? So e.g. if you oversample a group by the factor of two and one of the samples is not representing the actual distribution of the factor in the group you would falsely correct this group. On the other hand, you would give more weight to a group that is represented by just one sample which could also be not representative.
If the number of all samples is known, we can proportionally sample according to their different weights after traversing the entire sample. So how do we weighted sampling when we don’t know how big the total sample is, or the total is too large to traverse?
These are realtive terms, if you over-sample one type of data, you under-sample another.
This is what we call the output from a model, e.g. yhat in yhat = a*x+ b
Other names for dependent variable.
This is what we call the input to a model, e.g. x in yhat = a*x+ b
Other names for independent variables
Variables used as stand-in for other (often unmeasurable) variables. E.g. bmi for amount of body fat, GDP for welfare in a country.
A variable included in a regression to eliminate or “control for” a spurious relationship. I.e. a variable you are not interested to model per se, but want to remove the effect of. E.g. which week-day you did the measurement or which of two 48-well plates your sample was residing at.
This is unfortunately not covered well by the material in your prepatory reading. Sorry about that. Here is a general description of patsy notation. x:z translates into a product between the two terms x and z. I.e. yhat=axz + b
A residue is the difference between the estimated and observed variable, so the squared sum of the residuals are \sum_i (y_i-ŷ_i)^2.
It just meansd that the formula continues on the next line.
The criterium should be that they are significant.
One is the square of the other. A typical case where they differ is if r is negative, then r^2 is positive.
prioritize p.
There are automatic outlier tests. However, I recommend you to instead manually inspect the residuals of your module.
An interaction is any effect that relates to a combination of two variables. For instance if you want to test if two different drugs work better in combination than individually to reduce the weight of patients, you might want to test Weight_reduction ~ C(Drug1) + C(Drug2) + C(Drug1):C(Drug2)
An ANOVA is a statistical test you perform on linear regression models.
Its in their nature that the underlying effect is hard to measure. We select them as we cant measure the real deal.