What is statistical learning?
The simplest example: spam filters
If you ask me to classify an e-mail message as spam or as a legitimate one, I can pretty much do it without errors (especially if I'm the recipient). Machine learning attemps to emulate the human performance in this task by constructing automated spam filters, that by looking at the sender and the text of a mail are able to classify it as spam or legitimate.
There are many techniques and models that can be applied to build such a filter, but they have some commonalities.
They all try to learn a mathematical function: a pure function without side effects or internal state. And, in this branch of the field called supervised learning, they start from existing data to construct a model and its parameters.
Let's continue our example: if you ask me to automate the task of classifying mail, I wouldn't be able to write down a few lines of code that do the job for me, like I would with a database query or for calculating taxes (where a documented algorithm already exists.)
So what can I do? I can apply supervised learning techniques to a corpus of mail messages that a human has classified, and try to learn a general rule. This rule is our function:
boolean isSpam($sender, $subject, $text)
and can in theory be expanded with whatever argument we happen to know about the mail messages of our training set (the set containing already classified instances). Our hope is to analyze as many instances like:
email@example.com A Nigerian prince needs your help ... SPAM firstname.lastname@example.org Database courses offer ... SPAM email@example.com [xp-it] Agile and respect ... NOT SPAM
This is a (binary) classification problem, because the output we are trying to learn is a discrete set of (2) labels. Regression problems try to learn a function whose value is a real number instead.
Who gives us all the already classified data to learn from? After all, you asked me to build a spam filter because we didn't know how to classify a large number of mails automatically.
This is usually the user's job, with a little help of infrastructure:
What is our job is to choose a method (like a Bayes classifier) with its related model (for example representing all mails as the set of the words they contain.) We then use a learning algorithm (there may be more than one for each model) to train the model and find out its parameters. In the case of the Naive Bayes Classifier, the model is multiplicative:
SPAM = Prior(SPAM) * P(Word1|SPAM) * P(Word2|SPAM) ... * P(WordN|SPAM) NOT_SPAM = Prior(NOT_SPAM) * P(Word1|NOT_SPAM) * P(Word2|NOT_SPAM) ... * P(WordN|NOT_SPAM)
where the highest computed value wins. We include in the calculation all the words of the new mail to classify, but in our corpus we have to cover as many words as possible: our parameters range from P(Nigerian|SPAM) to P(iPad|SPAM) to P(Agile|NOT_SPAM). Stemming and regularization can help in reducing the number of words needed, but more data don't hurt.
It is possible that our model is too detailed to be generalized: we should keep in mind that our goal is to classify new mails, not just the one from the training set (by design we already know the class they belong to). For example, we may only have a few instances containing the word 'Hobbit', all of them unfortunately classified as spam. Therefore, this word will weigh very much in our model to pull mails towards the spam label.
What we can do to try to avoid building a model overfitting the data is separate the available classified instances in:
- a training set, used for learning the parameters of the model.
- a validation set, used for evaluating the performance of the model.
With a separate validation set, you can repeatedly train more complex models and check their errors on this set; for example, selecting only the 10 most frequent words from the mails, then the most frequent 20, and so on. The error on the training set will always go down with complexity, but it will reach a minimum on the validation set. This minimum is the sign that you have to early stop.
Statistical supervised learning is always composed of model choice, training set composition, learning and validation. Choosing the model also means to define how it works (there are usually codified mathematical model already existing), and to select which parameters we want to learn and what metric to use to measure the error of a copy of the model.
Training sets are the most important thing to look out for, if you ask Peter Norvig. Learning is again, a codified algorithm, and validation the final step necessary to ensure you're not producing garbage results.