Here we store material for the Data Analysis part of the CB2030, Systems Biology course.
This project is maintained by statisticalbiotechnology
To my understanding, the main advantage of unsupervised and semi-supervised learning is that these methods are less biased than supervised learning since they make conclusions/look for patterns inherent to the dataset rather than based on labels. The goal of supervised learning is to predict the output when it is given new data labelled in a similar way to the training dataset. Meanwhile, the goal of unsupervised learning is to find the patterns in unlabeled input data. Unsupervised learning is very useful for explanatory analysis, for example in clustering data points. The main disadvantage of supervised learning is that we can have a high risk of overfitting the data, which would lead to disproportionate conclusions. On the other hand, unsupervised learning can lead to poor correlation/patternization between data points, which would not be a very useful analysis.
No, you are right. A ROC score of 0.51 will often not be very valuable. I meant to say that a ROC score > 0.5 is a score that is better than a random prediction. Also, this is somewhat dependent on the research question. Sometimes one accept predictors with very low ROC score as other information is unavailable.
For no ther reason than that I wanted the abbrivaiations and the named metrics as separate metrics.
How to perform normalization in supervised learning? Frequently, one standardize each feature before training and testing.
I understood that PCA was used as data preprocessing in the example with the same recognition in order to extract features. Is this type of preprocessing a common practice for other types of data or it applies only to image identification?
It is not uncomon to use PCA as a dimensionality reduction technique prior to classification.
Yes, there indeed are. Your finite number of data points will spread wery thin in a high-dimensional feature space. This is true for any machine learning problem.
Yes, the curse of dimensionality is just a description of a problem, the name offers no solution in itself.
You can always redisign your problem so that it becomes a lower-dimension problem.
There is no rule. I often use 70/30.
When you overfit, you learn how to make accurate predictions on the dataset you trained on but you are unable to make predictions on novel data.
No, you can easily think of a scenario where you overfit in one dimension. Say that you collect two patient, one with lung-cancer that does not smoke, and anotherone without lung cancer that smokes 30 cigarettes a day. A lung cancer classifier that takes the number of smoked cigarettes a day as an input, with the decission boundary of 15 cigaretes a day, would perfectlyt separate the training data, but might not be very valuable in the clinic.
More training data, less noicy data, less complex architectures (i.e. kernels) and more regularization of the classifier reduces overfitting problems.
This situation should be detected in the testig stage of the classifier
A larger dataset reduces the problem of overfitting, but there are no guarantees it will remove the proiblem.
The notebook provide such an example.
Often one use nested cross validation to select suitable hyperparameters, and then one use the optimal hyperparameters to train a regular cross validation net. That way the best hyper parameters are found for each cross valiadatuion set indpendently.
No, the bearing idea is that we avoid overfitting, or at least are able to detect situations where we overfitt, by using cross validation.
It depends on the problem, however, the idea is that we want to detect situations where we overfitt.
To some extent there is a withhold of data when you cross validate. Also, it is easier to extend your prediction for new datapoints. I.e. which of your k trained predictors from k fold cross validation should you use when you encounter a new example.
There are different strategies for this. Often a final classifier is trained based on all data. However , such a classifier should not be used for reporting performance.
So that none of the learners trains and tests on the same data.
Yes!
Kernel functions could cover a full course of its own. However, here we stop at the fact that they are ways to introduce new dimensions to your data that makes it possible to divide the data into different classes.
Yes there are several different available kernels. Perhaps the most well known kernel is the RBF Kernel
A well known case in bioinformatics are so called string kernels, that evaluates similarities of text or amino acid strings. Another examples in phylogeny, is tree kernels.
Gamma is a parameter that controls the width of an RBF kernel.
Sometimes one can select kernels that makes sense based on what makes sense for the problem. In the example at slide 7, we would perhaps know that an absolute value was more suitable for classification, and would hence select a quadratic kernel.
There are several other supervised machine learning techniques available, such as Artificial Neural Networks, Naive Bayes, Random Forest, etc.
Instead of using one kernel, we use another kernel.
Yes, the choice of Kernel can be tuned just the same way as we tune hyperparameters.
You multiply the features together and use these products as additional features? The point is that this is an operation that a kernel performs for you.
Any function you like.
Here is wikipedia’s entry on the subject. However, that is not what I think you wonder about. y_i(wx_i-b) only makes sense as y is either -1 or 1, and you try to make wx_i-b take values >1 or <-1 depending on the value of y.
You dont choose how many datapoints that cross the boundary, you just choose how much you will penelize the ones that do cross the soft margins.
The SVM is just an optimization of a function. The optimize function allows for datapoints to cross the hyperplane, but with a penalty (through a hinge loss function).
The SVM optimes a function with two different components. ||w|| is the inverse of the size of the margin, the Σmax(0,1-y_i(wx_i-b)) is the slack penalty. The training of the SVM involves minimizing the sum of the two. We weight the relationship beteen the two terms, either by C or λ.
I find this Stack exchange post helpful. More generally, distances between hyperplanes can be calculated by formulas from this wikipedia entry on the subject.
Yes, it is always a hyperplane.
The slack penaly, C=1/λ is hyperparameter. The notebook contains an example of how one select hyperparameters with grid search. And yes there might be other hyperparameters in need of optimization during training.
We want a hyperplane that separates the AML and ALL patients, the margin would sit inbetween the AML and ALL patient samples.
Such data points are normally obvious, and might not to effect the classifier descission.
SVM are normally only used for labeled data sets. There are some interesting exceptions, like outlier detection with one class SVMs. Also, kernels by themselves have some interesting applications in unsupervised learning, for instance in visualization techniques such as UMAP
The soft margin is a feature of the SVM, not the kernel.
- That will be dependent on the shape of the data, but tends to increase with the dimensionality of the data and slack penalty. See e.g. this stack exchange post. The minimum number of support vectors needed, tend to be independent on the number of dimensions of the data
- Support vectors are defined as the data points closest to the decision boundary. Each data point is a vector and those closest span the margin and are the only ones needed to define the decision boundary, and I guess are those supporting it.
Yes and Yes!
Yes, normally manually annotated labels are frequently incorrectly set, and some ML algorithms can to some extent overcome such problems. However, it is tricky to prove that you improve on manual annotation, since you do not have any ground truth.
Supervised ML is preferred when you trust your labels, and often for lower dimensional problems. Unsupervised ML is preferred in other cases.
Yes and Yes! In unsupervised learning there are no errornous annotation to ruin your classification.
No, it depends on the problem. A bit circular: You should include as many features as needed, but not more. There are some theory associated with the number of selected features like Aike’s Information Criteron, however, it is nort always applicable.
When designing a classifier, you need to make sure that your training data is representative for what you actually want to classify. SVMs are not very different in this than any other classifier.
You know that you have a good training set when you get good performance on a classifier trained on the set. However, if you get low performance, it could be an indication that the design of your classifier and not the training set is to blame.
We can set hyper-parameters based on training data, but we might overfit.
We measure performance based on test sets?
Yes, a classifier’s performance is based on its ability to generalize from training data.
That your classifier should work well.
Not sure that I understand what you mean with overlap in this context.
No that is the easy case. We introduce soft margins to take care of less clear or outlier datapoints.
The derivation of the mathematical formulation of a support vector machine is given e.g. in the wikipedia entry on SVMs.
I am currently working with datasets of 25 million datapoints. Given you can use not too advanced classification you can expand the number of datapoints pretty far.
Yes, this is one of the points of a SVM.
Use a linear classifier?
Here is an example of one such implementation in sklearn
We could either move our classification boundary in the SVM, or we could use different soft margins for our positive and negative examples during training.
A hyperplane is easy to express. A kernel is warping the feature space to be separable by the (linear) hyperplane, instead of what you are suggesting, i.e. warping the decisions boundary.
There are many other methods.
Yes that is it. Trial and error is often the most frequent way to select models and hyper-parameters in other types of ML.
By introducing new dimensions. A nice example for this is the quandratic dimension in Figure 1j
Wikipedia mentions a couple of kernels.
Yes, under such condition we need more advanced ML methods e.g. a SVM with a more advanced kernel.
There are many different kernel functions available.
We do not know, we have to try for ourselves.
There is a description of a multi-class classification problem in VanderPlas.
Here are some examples from sklearn
Multi-class SVMs could work.
Multi-class classifiers might help you.
Yes, you use a multi-class classifier.
Again, trial and error is the most common method. You often to your feature selection on a separate dataset, to avoid overfitting. This said, feature selection is a research topic of its own.
Not in a visual way. You have to use your imagination, or possibly use a dimensionality reduction scheme to project your data to a 2d plot.
No, a hyper-parameter is a variable that controls the training procedure it self, that needs to be set for the procedure to work. The soft-margin in SVM training is such a parameter.
Its a choice between speed and risk to not detect overfits. However, in practice a low number like 5 or 3 works well for many problems.
There is no fix scheme for this. However, I normaly combine the classifiers’ test results based on their SVM score, or just by the ir classification result, as in the notebook.
I normally set C with cross validation or nested cross validation, as shown in the notebook.
As long as we test on separate data from our testing, this should be detectable.
The cross validation allow you to test which values that work best for your test data.
No It is a way to validate Supervised ML.
Yes, to some extent scientists rely on different heuristics for this.
Yes, there are classifiers using such methods, e.g. Logistic Regression