1.1 Learning from data
The simplest use case for a model trained from data is when a signal $x$, is accessible, for instance, the picture of a license plate, from which one wants to predict a quantity y, such as the string of characters written on the plate.
In many real-world situations where $x$ is a highdimension signal captured in an uncontrolled environment, it is too complicated to come up with an analytical recipe that relates $x$ and $y$.
What one can do is to collect a large training set $\mathscr{D}$ of pairs ($x_n$,$y_n$), and devise a parametric model $f$, a piece of computer code that incorporates trainable parameters $w$ that modulate its behavior, and such that, with the proper values $w^{\ast}$, it is a good predictor. “Good” here means that if an $x$ is given to this piece of code, the value $\hat{y}=f(x;w^{\ast})$ it computes is a good estimate of the $y$ that would have been associated to x in the training set had it been there.
This notion of goodness is usually formalized with a loss $\mathscr{L}(w)$ which is small when $f(\cdot;w)$ is good on $\mathscr{D}$. Then, training the model consists of computing a value $w^∗$ that minimizes $\mathscr{L}(w\ast)$.
Most of the content of this book is about the definition of $f$ which, in realistic scenarios, is a complex combination of predefined sub-modules.
The trainable parameters that compose w are often referred to as weights, by analogy with the synaptic weights of biological neural networks. In addition to these parameters, models usually depend on meta-parameters which are set according to domain prior knowledge, best practices, or resource constraints. They may also be optimized in some way, but with techniques different from those used to optimize $w$.