Introduction to Artificial Neural Networks for Regression

Introduction to Artificial Neural Networks

Part 1: Feedforward Networks for Regression

Elliott Forney, 2018

Thank you for finding your way to my tutorial series on Artificial Neural Networks (ANNs). These guides are intended to be an introduction to Machine Learning (ML) applications of neural nets for people who have at least a moderate level of experience with Python, NumPy, data analysis, linear algebra and some calculus. It is my hope that these tutorials will help to demystify the inner workings of neural networks and give the reader a strong starting point for designing and engineering neural nets to solve your own unique problems.


This tutorial is written in a Jupyter Notebook in Python3 so that you can actually run the experiments as you read! I will also assume that you have some experience with array-based programming in NumPy and a basic knowledge of linear algebra and calculus. If you don't have prior experience with Python and NumPy, it is probably a good idea to brush up on those tools first. If you don't have a strong background in mathematics, you can probably continue and skim over some of the details of the derivations.

In [1]:
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10.0,6.0)


In machine learning and data science we frequently think in terms of data and models. If we know the model that generated a given data set, then things are generally made quite easy. It is generally the case, however, that we have some dataset and wish to derive a model that could have potentially generated it. If we can find a good model, then we can make predictions about new data once we encounter it.

Another useful way to think about models is in terms of inputs and outputs. For a given observation (the model inputs) the model generates some type of predictions (the model outputs). Statisticians like to call model inputs "predictors" and model output "response variables" but this always seems less descriptive to me.

Regression is the act of finding an approximate model that maps input variables to outputs variables.

Linear Least-Squares Regression

Linear Least-Squares Regression, often simply called Linear Regression, is straight forward and yet incredibly powerful tool for performing regression. Linear Regression assumes that out model outputs can be described as a linear combination of our model inputs. Suppose that a single observation to be fed into our model is a row vector \begin{equation} \textbf{x} = [x_1, x_2, \ldots, x_F] \end{equation} where $F$ is the number of input dimensions to out model, also called the number of features. The the predictions of a linear regression model can be described as \begin{equation} y_j = \sum\limits_{i=0}^{F} x_i \cdot w_{i,j} + b \end{equation} where $b$ is a constant bias value, $w_{i,j}$ is the model weight associated with the $i$'th input dimensions and the $j$'th output dimensions and $y_j$ is the predicted value along the $j$'th output dimension.

Linear regression can also conveniently be written entirely in matrix notation. Let $\mathbf{X} \in \mathbb{R}^{N \times F}$ be a matrix where the columns represent input dimensions and the rows represent observations and $\mathbf{W} \in \mathbb{R}^{F \times K}$ be a matrix of weights. A linear regression model can then be written as \begin{equation} \mathbf{Y} = \mathbf{\tilde{X}} \mathbf{W} \end{equation} where where $K$ is the number of output dimensions and where the tilde above $\mathbf{X}$ denotes that a column of ones has been added in order to incorporates our bias terms and where $\mathbf{Y} \in \mathbb{R}^{N \times K}$ is our matrix of predictions.

A linear regression model can also be visualized as network graph, which illustrates how information flows through the model.


Parameter estimation for Linear Regression

Now that we have established how information flows through an LR model, we need to determine the values of $\mathbf{W}$ that allow the model to actually fit a given dataset.

First need a target dataset $\mathbf{G} \in \mathbb{R}^{N \times K}$.

Then we wish to minimize squared error \begin{align} E(\mathbf{W}) = (\mathbf{Y} - \mathbf{G})^2 = (\mathbf{\tilde{X}}\mathbf{W} - \mathbf{G})^2 \end{align}

From calculus, recall that the derivative of a continuous function must be zero at it's minimum. The gradient of our error function is the matrix of it's derivatives, \begin{align} \nabla E(\mathbf{W}) = & 2 \mathbf{\tilde{X}}^T (\mathbf{\tilde{X}}\mathbf{W} - \mathbf{G}) \\ = & 2 \mathbf{\tilde{X}}^T \mathbf{\tilde{X}}\mathbf{W} - 2 \mathbf{\tilde{X}}^T \mathbf{G}. \\ \end{align}

We can then set the gradient to zero in order to find the local minimum \begin{align} & \nabla E(\mathbf{W}) = 0 \\ \Rightarrow & 2 \mathbf{\tilde{X}}^T \mathbf{\tilde{X}}\mathbf{W} = 2 \mathbf{\tilde{X}}^T \mathbf{G} \\ \Rightarrow & \mathbf{W} = (\mathbf{\tilde{X}}^T \mathbf{\tilde{X}})^{-1} \mathbf{\tilde{X}}^T \mathbf{G} \\ \end{align} As it turns out, linear regression is convex, meaning that there is only one minimum on the error surface. We can verify this by finding the second-order gradient, left as an exercise for the reader

Linear regression example

Let's implement linear regression in python...

In [2]:
def bias(v):
    """Add a column of ones for the bias term.
    ns = v.shape[0]
    return np.hstack((v, np.ones(ns).reshape((ns,-1))))

# example
a = np.arange(5)[:,None]
array([[0., 1.],
       [1., 1.],
       [2., 1.],
       [3., 1.],
       [4., 1.]])
In [3]:
class LinearRegression(object):
    """Linear least squares regression
    def __init__(self, x, g, **kwargs):
        self.train(x, g, **kwargs)
    def train(self, x, g):
        # analytical solution, fails for underdetermined problems
        x1 = bias(x)
        self.w = np.linalg.solve(x1.T @ x1, x1.T @ g)

    def predict(self, x):
        return x @ self.w[:-1] + self.w[-1]

Let's consider a simple example of a linear function with normally distributed noise...

In [4]:
ns = 100
x = np.linspace(0, 1, ns)[:,None]
x -= x.mean(axis=0)
x /= x.std(axis=0)

g = 20.0*x + np.random.normal(size=(ns,1))
g -= g.mean(axis=0)
g /= g.std(axis=0)
In [5]:
lm = LinearRegression(x, g)
y = lm.predict(x)
In [6]:
plt.scatter(x, g)
plt.plot(x, y, color="red");

Regression for nonlinear problems

As it's name suggests, linear regression is purely linear in it's inputs, meaning that it is not directly able to model nonlinear curves. Consider the nonlinear example below.

In [7]:
# a quadratic curve
g_curved = 10*x**2 + np.random.normal(size=(ns,1))
g_curved -= g_curved.mean(axis=0)
g_curved /= g_curved.std(axis=0)

lm_curved = LinearRegression(x, g_curved)
y_curved = lm_curved.predict(x)

plt.scatter(x, g_curved);
plt.plot(x, y_curved, color="red");

There are, however, many tricks that can be used to model nonlinear problems using the linear least squares approach. For example, we can transform our inputs so that the model is nonlinear over the original inputs. Below, we combine both $X$ and $X**2$ so that our linear regression model can fit our quadratic curve. The approach can be further extended using the kernel trick, known as kernel ridge regression.

In [8]:
x_curved = np.hstack((x, x**2))
lm_curved2 = LinearRegression(x_curved, g_curved)
y_curved2 = lm_curved2.predict(x_curved)

plt.scatter(x, g_curved);
plt.plot(x, y_curved2, color="red");

Neural networks are generic function approximators

One of the primary challenges with transforming the inputs to a linear model is that it requires some level of a priori knowledge about the type of transformation that is necessary. In the above example, we knew that a quadratic transformation was appropriate through visualization. In sophisticated, high-dimensional problems we often have no way to know what type of input transformation will work.

In order to solve this problem in a generic way, we can add a new layer to our model that contains flexible nonlinearities that can automatically learn to fit inputs. This approach is known as a Neural Network (NN) and is also sometimes called a Multilayer Perceptron (MLP).

Universal approximator...


Again, assume that our input matrix $\mathbf{X} \in \mathbb{R}^{N \times F}$. We now have two weight matrices: a hidden weight matrix for our nonlinearities and a visible weight matrix for our final linear regression. $\mathbf{W}_h \in \mathbb{R}^{F+1 \times M}$ and $\mathbf{W}_v \in \mathbb{R}^{M+1 \times K}$

Equations for the forward pass (to evaluate the network): \begin{align} \mathbf{Z} = & \phi( \mathbf{W}_h \mathbf{\tilde{X}} ) \\ \mathbf{Y} = & \mathbf{W}_v \mathbf{\tilde{Z}} \end{align}

Our error function is, again, squared error: \begin{align} E = & (\mathbf{Y} - \mathbf{G})^2 \end{align}

We can find the gradient of the visible layer \begin{align} \nabla_{W_v} E = & \nabla_{Y} E \cdot \nabla_{W_v} Y \\ = & \mathbf{Z}^T 2(\mathbf{Y} - \mathbf{G}) \\ = & \phi(\mathbf{W}_h \mathbf{\tilde{X}})^T 2(\mathbf{Y} - \mathbf{G}) \end{align}

and the gradient of our hidden layer \begin{align} \nabla_{W_h} E = & \nabla_{Y} E \cdot \nabla_{H_v} Y \\ % = & \mathbf{\tilde{X}}^T 2(\mathbf{Y} - \mathbf{G})^T \phi^\prime(\mathbf{W}_h \mathbf{\tilde{X}}) \end{align}

We cannot, however, set the gradient to zero like we did for linear regression.

This is because optimizing a neural network is not a convex problem. There may be many optimal solutions, called local minima.

To see this, notice that there are symmetries for each hidden unit.

In [9]:
def weight_init(size):
    """Initialize NN weight matrices, Lecun fast backprop
    return np.random.uniform(-np.sqrt(3.0 / size[0]),
                np.sqrt(3.0 / size[0]), size=size)
In [16]:
class ForwardNet(object):
    def __init__(self, x, g, nh,
                 phi_prime=lambda v: 1.0 - np.tanh(v)**2,
        """Feedforward neural network with two fully-connected layers.
        ni = x.shape[1] # num inputs
        no = g.shape[1] # num outputs
        self.phi = phi  # transfer func
        self.phi_prime = phi_prime # grad of trans func

        # initialize hidden weights
        self.hw = weight_init((ni+1, nh))
        # initialize visible weights
        self.vw = weight_init((nh+1, no))
        self.train(x, g, **kwargs)

    def gradient(self, x, g):
        """Compute MSE and error gradients
        # forward pass
        x1 = bias(x)
        h = x1 @ self.hw
        z1 = bias(self.phi(h))
        z_prime = self.phi_prime(h)
        y = z1 @ self.vw
        # error components
        e = y - g
        delta = 2.0 * (y - g) / e.size

        # visible layer gradient
        vg = z1.T @ delta

        # error backprop through visible layer
        delta = delta @ self.vw[:-1].T * z_prime

        # hidden layer gradient
        hg = x1.T @ delta

        # error, hidden grad, visible grad
        return np.mean(e**2), hg, vg
    def train(self, x, g,
              learning_rate=0.1, maxiter=5000, tol=1.0e-5):
        """Steepest descent
        i = 0
        prev_error = np.inf
        while True:
            error, hg, vg = self.gradient(x, g)
            if (i % 100) == 0:
                print(i, error)

            if np.abs(prev_error - error) < tol:
                print("reached tol")
            if i > maxiter:
                print("reached iter")

            self.hw -= learning_rate * hg
            self.vw -= learning_rate * vg
            prev_error = error
            i += 1

    def predict(self, x):
        """Compute predictions for inputs x
        h = x @ self.hw[:-1] + self.hw[-1]
        z = self.phi(h)
        y = z @ self.vw[:-1] + self.vw[-1]
        return y
    def eval_hidden(self, x):
        """Generate outputs of hidden units.
        h = x @ self.hw[:-1] + self.hw[-1]
        z = self.phi(h)
        return z

Our quadric example again...

In [17]:
nn_curved = ForwardNet(x, g_curved, nh=5, learning_rate=0.1)
y_curved_nn = nn_curved.predict(x)

plt.scatter(x, g_curved);
plt.plot(x, y_curved_nn, color="red");
0 1.2723877003432735
100 0.1112930946913207
200 0.04307358497418107
300 0.028289726612308108
400 0.022211492667359445
500 0.01915311024513723
600 0.017435331235376474
reached tol
In [18]:
z_curved_nn = nn_curved.eval_hidden(x)
plt.plot(x, z_curved_nn);
plt.xlabel("Input (x)")
plt.ylabel("Hidden unit response (z)");

Something a little more sophisticated

In [19]:
g_ripple = g + 0.5 * np.sin(4*x)
g_ripple -= g.mean(axis=0)
g_ripple /= g.std(axis=0)

nn_ripple = ForwardNet(x, g_ripple, nh=10, learning_rate=0.1)
y_ripple = nn_ripple.predict(x)

plt.scatter(x, g_ripple);
plt.plot(x, y_ripple, color="red");
0 1.6171976294298265
100 0.11186027633140838
200 0.11024823513110617
300 0.1089209635292726
400 0.10759353403007871
500 0.10608882308015342
600 0.10427279235774833
700 0.10201478818278634
800 0.09916321157636462
900 0.09553540684355748
1000 0.09093770500961687
1100 0.08523943368063996
1200 0.07847174842907936
1300 0.07084730964255596
1400 0.06268887575174613
1500 0.05437468447478249
1600 0.04632270787501674
1700 0.03895016204410438
1800 0.03257916637565257
1900 0.027351673446375374
2000 0.023225231994035086
2100 0.020036337854225216
2200 0.017577306849734263
2300 0.015652417040407568
2400 0.014105440873672801
2500 0.012824823826253638
2600 0.011736725831783376
reached tol
In [20]:
z_ripple = nn_ripple.eval_hidden(x)
plt.plot(x, z_ripple);
plt.xlabel("Input (x)")
plt.ylabel("Hidden unit response (z)");

Alternative optimization techniques

Steepest descent has a lot of problems...

Batch / offline: full gradient

Stochastic / online: one sample at a time

Mini-batch stochastic: small mini-batch of samples for each update

Stochastic Gradient Descent

Update weights for each observation, or a small mini-batch of them

Can avoid local minima

works well if cannot fit data into memory or are continually updating model

tends to require fewer passes over all the data

but is noisy (stochastic)

and performs lots of small and expensive gradient evaluations

In [26]:
class ForwardNetSGD(ForwardNet):
    """Feedforward network using Stochastic Gradient Descent
    def train(self, x, g, batch_size=10,
              learning_rate=0.1, maxiter=5000, tol=1.0e-5):
        i = 0
        prev_error = np.mean((self.predict(x) - g)**2)

        while True:
            if i > maxiter:
                print("reached iter")
            for start in range(0, x.shape[0]-1, batch_size):
                end = min(start+batch_size, x.shape[0]-1)                
                _, hg, vg = self.gradient(x[start:end], g[start:end])
                self.hw -= learning_rate * hg
                self.vw -= learning_rate * vg

            error = np.mean((self.predict(x) - g)**2)
            if (i % 100) == 0:
                print(i, error)

            if np.abs(prev_error - error) < tol:
                print("reached tol")
            prev_error = error
            i += 1
In [27]:
x_scale = 1.0 * x + 2.0

g = 20.0*x_scale + np.random.normal(size=(ns,1))
g -= g.mean(axis=0)
g /= g.std(axis=0)
g_ripple = g + 0.5 * np.sin(4*x_scale)
g_ripple -= g.mean(axis=0)
g_ripple /= g.std(axis=0)

nn_ripple_sgd = ForwardNetSGD(x_scale, g_ripple, nh=10, learning_rate=0.01)
y_ripple_sgd = nn_ripple_sgd.predict(x_scale)

plt.scatter(x_scale, g_ripple);
plt.plot(x_scale, y_ripple_sgd, color="red");
0 1.59381196409321
100 0.13399162650915225
200 0.12721549764481888
300 0.12500571258241205
400 0.12345153113444972
500 0.12195269061012654
600 0.12042761244416149
700 0.1188034098097414
800 0.11695818995401604
900 0.11473273876224432
1000 0.11212174327286119
1100 0.10931380706517672
1200 0.10633119396964195
1300 0.10311223698659912
1400 0.09965964683605268
1500 0.09601783521233984
1600 0.09223883807715462
1700 0.08837034938406786
1800 0.08445184117778629
1900 0.08051310142340302
2000 0.07657667624604574
2100 0.07266380633954346
2200 0.06879864267526976
2300 0.06500370221353298
2400 0.06128314114772787
2500 0.0576004238685464
2600 0.053889210235500216
2700 0.050176934315300174
2800 0.04665651993392122
2900 0.043443541333521976
3000 0.04050268785972002
3100 0.03777499396521457
3200 0.03522088643077816
3300 0.03281362017949456
3400 0.030532830620618352
3500 0.028362322238931977
3600 0.026287711389304334
3700 0.024294650456415066
3800 0.02237228425429318
3900 0.020520069466889012
4000 0.01874927885359521
4100 0.017077298439337075
4200 0.015520881364415977
4300 0.014092478005196212
4400 0.01279923912977919
4500 0.011643205005571073
reached tol

Conjugate gradients

Uses full gradient, assumes error surface can be approximated by quadratic function in a local neighborhood

In [28]:
import scipy.optimize as spopt

class ForwardNetCG(ForwardNet):
    def train(self, x, g,
              maxiter=500, tol=1.0e-6):

        def error_func(v):
            self.hw.flat[...] = v[:self.hw.size]
            self.vw.flat[...] = v[self.hw.size:]
            return np.mean((self.predict(x) - g)**2)
        def grad_func(v):
            self.hw.flat[...] = v[:self.hw.size]
            self.vw.flat[...] = v[self.hw.size:]
            _, hg, vg = self.gradient(x, g)
            return np.concatenate((hg.flatten(), vg.flatten()))
        def callback(v):
            if (callback.i % 100) == 0:
                self.hw.flat[...] = v[:self.hw.size]
                self.vw.flat[...] = v[self.hw.size:]
                print(callback.i, np.mean((self.predict(x) - g)**2))
            callback.i += 1
        callback.i = 0

        method = "BFGS"
        options = {"maxiter": maxiter}
        w = np.concatenate((self.hw.flat, self.vw.flat))
        optres = spopt.minimize(fun=error_func, method=method,
                x0=w, tol=tol, jac=grad_func, options=options)#, callback=callback)

        print("error: ", np.mean((self.predict(x) - g)**2))
In [29]:
nn_ripple_cg = ForwardNetCG(x, g_ripple, nh=10, maxiter=150)
y_ripple_cg = nn_ripple_cg.predict(x)

plt.scatter(x, g_ripple);
plt.plot(x, y_ripple_cg, color="red");
      fun: 0.00246915745731666
 hess_inv: array([[ 5.0327e+01,  2.8182e-02,  5.4347e+01,  9.7661e+00, ...,
         3.5631e+00, -3.0847e+01, -4.0374e+01, -9.7853e+00],
       [ 2.8182e-02,  2.7082e+01,  1.0074e+02,  2.9949e+01, ...,
         1.8472e+01, -2.9060e+01, -1.0531e+02,  4.3195e+01],
       [ 5.4347e+01,  1.0074e+02,  4.8906e+02,  1.3734e+02, ...,
         8.2194e+01, -1.6427e+02, -4.8530e+02,  1.6620e+02],
       [ 9.7661e+00,  2.9949e+01,  1.3734e+02,  4.7224e+01, ...,
         2.8770e+01, -5.3429e+01, -1.5838e+02,  5.3672e+01],
       [ 3.5631e+00,  1.8472e+01,  8.2194e+01,  2.8770e+01, ...,
         1.9211e+01, -3.2451e+01, -9.8318e+01,  3.4489e+01],
       [-3.0847e+01, -2.9060e+01, -1.6427e+02, -5.3429e+01, ...,
        -3.2451e+01,  7.2523e+01,  1.8494e+02, -5.1743e+01],
       [-4.0374e+01, -1.0531e+02, -4.8530e+02, -1.5838e+02, ...,
        -9.8318e+01,  1.8494e+02,  5.5171e+02, -1.8725e+02],
       [-9.7853e+00,  4.3195e+01,  1.6620e+02,  5.3672e+01, ...,
         3.4489e+01, -5.1743e+01, -1.8725e+02,  7.9582e+01]])
      jac: array([-9.7644e-04,  2.4594e-04, -5.2496e-06, -4.4166e-04,  5.0041e-05,
       -1.5085e-04, -6.5780e-04,  2.2680e-04, -1.3022e-04, -5.2381e-04,
       -9.3809e-04,  1.4269e-04,  6.4173e-06, -3.6616e-04,  1.9276e-04,
        2.1294e-04,  1.3697e-03,  1.3742e-04,  3.4747e-05,  3.0874e-04,
       -1.2928e-05, -1.7463e-04,  2.3611e-04,  1.0086e-04, -3.2683e-04,
       -1.9041e-04,  2.3476e-04,  5.3742e-05,  3.8784e-04, -2.9698e-04,
  message: 'Maximum number of iterations has been exceeded.'
     nfev: 155
      nit: 150
     njev: 155
   status: 1
  success: False
        x: array([-2.0947, -0.0208,  2.9069,  1.1244, -1.2106, -1.035 ,  1.5862,
       -0.0281,  1.7245, -1.6365,  2.2363, -0.9213,  6.8374, -0.9267,
       -1.1209, -0.1256,  0.7542,  0.2795,  2.2306, -3.0978, -2.6676,
        1.3533, -2.8993, -1.8429,  1.963 ,  2.1612,  4.7253,  0.6161,
       -1.3654, -3.2126,  2.4249])
error:  0.00246915745731666
In [ ]: