Algorithms in Data Science are often described as Supervised or Unsupervised. There is also a growing trend in data science in Semi-Supervised learning.
Imagine we have a population of dogs, and we’ve measured particular features of a dog, such as height, ear size, muzzle length, fluffiness and so on.
Say we are interested in how the breed of a dog relates to these features, as well as how old each dog is likely to live.
Prediction of a category – “breed”
In supervised learning, we already had a dog expert go through and determine the breed of each dog, and store that against each dog record or ‘sample’. This is the supervision part – an expert has done what is termed ‘class labelling’ in Data Science, where you assign a sample to a particular class (in this case breed).
Say we then get a new dog, and we don’t have the expert with us. We can use a supervised learning algorithm, such as a Decision Tree Classifier, to create a model that predicts what breed this new dog is, by using the data from existing known dogs. This way, the dog expert no longer needs to classify dog breeds for you, and can spend more time patting dogs. Win win!
Prediction of a numeric field – “life expectancy”
Imagine the data was from years ago, all the dogs have now sadly crossed the rainbow bridge. We obtain the age each dog lived to, and add it to our list of features. We can use a supervised learning algorithm such as Multiple Linear Regression, to predict the life expectancy of a new dog.
Both algorithms used to create the original predictive model are supervised learning algorithms, because we know the outcome, or “target”, that we are looking for – in this case breed, or life expectancy – and that “target” is used to create the model.
Say we want to divide the population of dogs into groups of similar dogs, but we don’t necessarily have an idea of what those groups are. They may be related to breed, they may not.
Unsupervised learning allows us to do this, by just giving them a list of features, and depending on the parameters you give the algorithm, divide the dogs into groupings of similar features.
Clustering algorithms such as K-means, K-medoids and DBScan are examples of unsupervised algorithms. The output they provide is simply the dogs split into different groups, but no indication or ‘class label’ of what those groups are, only that they are different from each other.
Some algorithms have input parameters, such as the number of clusters. So say you suspect there are 5 different breeds in your population, you could put in 5 as the parameter of clusters into K-means, and the algorithm may be able to separate them into 5 groups representing each breed. The algorithm hasn’t been told what the breeds are, hence the algorithm is ‘unsupervised’.
Labelling is expensive work. It costs money to get an expert to label things, and sometimes you might have some labelled data, and a lot of unlabelled data.
For example, you might have the data of 10,000 dogs. Your dog expert might label 1000 dogs, and then get bored and go play with the dogs instead.
In this case, a semi-supervised algorithm makes use of both the labelled data as well as the unlabelled data to train the model to predict dog breed or size.
The simplest form of semi-supervised learning is called self-learning. In this case, you grab the data from the 1000 labelled dogs, and using the supervised algorithm of your choice, create a model. You then use that model to predict a few samples without labels (say we choose k=5), and then use your enlarged sample of 1005 dogs to create your model. You keep going adding k samples at a time, until all the unlabelled dogs are labelled.
Some algorithms will also tell you the confidence they have in their predictions, and in this case you use that to choose the k samples with the highest confidence to add back into your pool first. This results in a more accurate model.
Another form of semi-supervised learning is called co-training. In this case, you create two classifiers, but each based on different sets of features (you may hear this described as views). In our example, the first view might be “height” and “ear size”, and the other sample “muzzle length” and “fluffiness”.
The first model classifier is created using all 1000 labelled dogs, but only using “height” and “ear size” as inputs. Likewise for the second but only using “muzzle length” and “fluffiness”.
We then choose a value of samples k that we take each time, and take k samples of dogs with new labels each to run through each of the two models we created. As in self-learning, if the model provides us with the measure of confidence, we predict them all and take the k with the highest confidence. Now, take the k labelled samples from first model, add that to the training data giving 1000+k samples to be used for second model, and vice versa – the k labelled samples from the second model get added to the pool of data for the first model. Then create both models on their newly expanded data. Repeat till everything is labelled.
Co-training is less sensitive to mistakes than self-learning. However the models that it creates may not always be effective, as often a model created with more features will do better than one with less – and we are creating each model with only half of the features. When using co-training we are assuming that each view/set of features alone are good enough to make a classifier, and also that each set of features are independent of each other.