Learning algorithms are the seeds, data is the soil, and the learned programs are the grown plants. The machine learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil, and keeping an eye on the health of the crop but otherwise staying out of the way.
In traditional computer science, an algorithm inputs data and outputs data. Machine learning algorithms are different: they are called learners, they input data and output other algorithms. The algorithms produced by learners are of several types, but the most common ones are called classifiers. They are used to assign a class, or label, to an object having certain numeric or categorical features.
You are surrounded by machine learning classifiers. They are there when Amazon suggests you a new product, or Netflix a new TV series. They are there when Google translates a website for you, or when your word processor checks the grammar of your document. They are there when you ask Siri to search for your favorite restaurant, when a surveillance camera detects anomalous movements, and even when the doctor makes you the diagnosis of a disease.
Everything you do, everything you use, everything you like, machine learning is there. And it is shaping a new world in ways that we can hardly understand. Nowadays, the Amazon classifier determines the success of books and goods much more than the quality of the book and goods themselves. Your webpage has visibility on the web if Google puts it on the top of its ranking, which is generated with the help of complex classifiers. Machine Learning is such a powerful technology that it is not wise to take it as a black box. We cannot control what we do not understand. If we want to understand the world we live, we must study and understand machine learning.
There are a few basic concepts, tricks and lessons I learned while dealing with machine learning. I want to share them with you. Maybe it is a first step for newcomers to understand the “black magic” behind machine learning.
Learner vs Classifier
Machine Learning works in two steps, learning and prediction. During the first step, you use a learner to generate a classifier. A learner inputs a training set of examples and produces a classifier. An example is a pair (x, y), where x is an object represented as a set of features and y the class of the object. A classifier is an algorithm that inputs an object represented by the same set of features and outputs the class of the object. A learner is a meta-model you can use to create a classifier in many situations. Imagine a learner as your skilled experience: you can apply it to many contexts to create something new. A classifier is a specific predictive model learnt on specific data by means of a learner. Imagine it as a specific product you created by your skilled experience.
Remember: do not confuse learners and classifiers.
A good classifier must be good in predicting the class of an object that was not used by the learner during the training process. However, there is a ruthless enemy that can keep you from achieving this goal: overfitting.
Overfitting is like the dragon for German hero Sigurd: it is hard to defeat, but if you succeed you then acquire supernatural powers. Overfitting means that the classifier generated by the learner cannot generalize to unseen objects, i.e., it assigns the objects not used in the learning phase to the wrong class. This happens because the learner adapted too much to the training data and it is not able to generalize. This problem can be done by several factors: too few data, too many features, errors in the training data (all three). There are plenty of techniques to verify whether the classifier is overfitted. An overfitted classifier is useless, but a non-overfitted classifier gives you supernatural powers.
Remember: overfitting is your enemy.
Give feature engineering the same attention you give to your partner
The quality of your classifier in mainly the quality of your features. This is why you should dedicate most of the time to feature engineering, i.e., the process of creating and selecting the right features to represent the objects you would like your classifier to classify. Choosing the right features is the most powerful sword you can use to defeat the overfitting dragon.
Feature engineering is also the most creative part of a machine learning process. It is indeed domain-dependent and relies on the intuition and the experience of the data scientist performing the machine learning task. In many cases, the popular saying “less is more” perfectly applies to feature engineering. You may think “the more features I put the more the learner can learn”. Wrong. If you put too many features the learner may generate an overfitted classifier. To obtain a good classifier, the learner needs the right number of good features. How to choose them? Welcome to the black art of machine learning.
Remember: good features, good classifiers.
Open the Black Box
Ok, you made a good feature engineering, you generated the classifier (well, the leaner did it for you), and you verified that the classifier is good. Well, this not necessarily the end of the story.
The classifier represents the function between the data and the class. It embeds a complex reasoning. Although it seems to be a black box, in many cases you can open it and disclosure a lot of knowledge. For certain classifiers, such as neural networks, this is not possible. They are too complex and hardly interpretable by humans. In contrast other learners produce classifiers that can be easily read and interpreted by humans (e.g., decision trees).
Opening the box and looking at the reasoning of the classifier when making predictions is highly valuable and can lead to interesting insights on the nature of the phenomena represented. Although this is a phase generally neglected by many data scientists, I believe is going to be the most important one.
The General Data Protection Regulation (GCPR) by the European Union establishes a right of explanation for all individuals to obtain “meaningful explanations of the logic involved” when automatic profiling and predictions take place. In other words, starting from 2018 classifiers should be interpretable by humans. Therefore when performing your machine learning task, consider that in some situations a less accurate but more interpretable classifier (e.g., a decision tree) may be preferable to a highly accurate black box classifier (e.g., a deep learning classifier).
Remember: try open the black box and explain your classifier.
There are many other concepts and tricks you should know about machine learning. Let me suggest you this beautiful paper by Pedro Domingos.
As every art, machine learning requires practice, practice, practice. You cannot defeat the dragon if you do not train hard to do it.
Post by Luca Pappalardo