## Deep Learning and Paradigms

By Sergi Andreu

// This post is the 2nd. part of the “Opening the black box of Deep Learning” post

#### Deep Learning

Now that we have some intuition about the data, it’s time to focus on how to approximate the functions f that would fit that data.

When doing *supervised learning*, we need three basic ingredients:

– An *hypothesis space*, \mathcal{H}, consisting on a set of trial functions. All the functions in the hypothesis space can be determined by some parameters \theta. That is, all the functions f \in \mathcal{H} are given by f ( \cdot ; \theta ). We therefore want to find the best parameters \theta ^*.

– A *loss function*, that tells us how close we are from our desired function.

– An *optimization algorithm*, that allows us to update our parameters at each step.

Our algorithm consists on **initializing** the process by choosing a function $f( \cdot , \theta) \in \mathcal{H}$. In **Deep Learning**, this is done by **randomly choosing some parameters** \theta _0, and initializing the function as f( \cdot ; \theta_0).

In **Deep Learning**, the **loss function** is given by the **empirical risk**. That is, we choose a subset of the dataset, \{x_j , y_j\}_{j=1}^{s} and compute the loss as \frac{1}{s} \sum_{j=1}^{s} l_j ( \theta), where l_j could be simply given by l_j = \| f(x_j, \theta) - y_j \|_2^2.

The **optimization algorithms** used in Deep Learning are mainly based in **Gradient Descent**, such as

Where the subindex denotes the parameters at each iteration.

We expect our algorithm to converge; that is, we want that, as k \to \infty, we would have \theta_{k} close to \theta^*, where the notion of “closeness” is given by the chosen loss function.

#### Paradigms of Deep Learning

There are many questions about **Deep Learning** that are not yet solved. Three of that paradigms, that would be convenient to treat mathematically, are:

– *Optimization*: How to go from the initial function, F_0, to a better function \tilde{F}? This has to do with our optimization algorithm, but also depends on how do we choose the initialization, F_0.

– *Approximation*: Is the **ideal solution** F^* in our hypothesis space \mathcal{H}? Is \mathcal{H} at least dense in the space of feasible ideal solutions F^*? This is done by characterizing the function spaces \mathcal{H}.

– *Generalization*: How does our function \tilde{F} generalise to previously unseen data? This is done by studying the **stochastic properties** of the **whole distribution**, given by the **generalised dataset**.

*approximation*is somehow resolved by the so-called

**Universal Approximation Theorems**(look, for example, at the results by Cybenko [11]). This results state that the functions generated by *Neural Networks* are dense in the space of continuous functions. However, in principle it is not clear what do we gain by increasing the number of layers (using *deeper Neural Networks*), that seem to give better results in practice.

*optimization*, altough algorithms based in Gradient Descent are widely used, it is not clear whether the solution gets stucked in local minima, if convergence is guaranteed, how to initialize the functions…When dealing with

*generalization*, everything gets more complicated, because we have to derive some properties of the

**generalised dataset**, from which we do not have access. It is probably the less understood of those three paradigms.

**optimization**is completely solved, we would know that our final solution given by the algorithm, \tilde{F}, would be the best possible solution in the hypothesis space, \hat{F}; and if

**approximation**is solved, we would also end up getting the ideal solution F^*.

**Neural Network**, and so it is quite difficult to characterize mathematically. That is, we do not know much about the shape of \mathcal{H} (although the

**Universal Approximation Theorems**give valuable insight here), nor how to “move” in the space \mathcal{H} by tuning \theta.

#### Hypothesis spaces

**minimizes the distance**with the

**target function**, which connects the datapoints x_i with their correspondent labels y_i.

**Deep Neural Networks**, the hypothesis space \mathcal{H} is generated by the composition of simple functions f such that

#### Multilayer perceptrons

**multilayer perceptrons**, the functions f^k_{\theta_k} are constructed as

**activation function**.

**complexity**of the

**hypothesis spaces in Deep Learnin**g is generated by the

**composition of functions**. Unfortunately, the tools regarding the understanding of the discrete composition of functions are quite limited.

**Residual Neural Networks**.

#### Residual Neural Networks

**Residual Neural Network**(often referred to as

**ResNets**) the functions f_{\theta_k}^{k} (\cdot) are constructed as

#### Visualizing Perceptrons

The **Neural Networks** then represent the compositions of functions f, which consist on an affine transformation plus a nonlinearity.

If we use the **activation function** \sigma ( \cdot ) = tanh (\cdot), then each function f (and so each layer of the Neural Network) is first rotating and translating the original space, and then “squeezing” it.

Example of the transformation of a regular grid by a perceptron

// This post continues on post: **Perceptrons, Neural Networks and Dynamical Systems **