IT (is) Explained
Information Technology Explained

The Maths behind Neural Networks

Alex Punnen
© All Rights Reserved


Chapter 1

Vectors, Dot Products and the Perceptron

To understand the maths or modelling of the neural network, it is best to start from the beginning when things were simple. The earliest neural network - the Rosenblatt’s Perceptron was the first to introduce the concept of using vectors and the property of dot product to split hyperplanes of input feature vectors. These are the fundamentals that are still used today. There is a lot of terms above, and the rest of the article is basically to illustrate these concepts from the very basics.

Before we talk about why Neural Network inputs and weights are modelled as vectors (and represented as matrices) let us first see what these mathematical concepts mean geometrically. This will help us in understanding the intuition of these when they are used in other contexts/ in higher dimensions.


A vector is an object that has both a magnitude and a direction. Example Force and Velocity. Both have magnitude as well as direction.

However we need to sepcify also a context where this vector lives -Vector Space. For example when we are thinking about something like Force vector, the context is usually 2D or 3D Euclidean world.



(Source: 3Blue1Brown)

The easiest way to understand the Vector is in a geometric context, say 2D or 3D cartesian coordinates, and then extrapolate it for other Vector spaces which we encounter but cannot really imgagine. This is what we will try to do here.

Matrices and Vectors

Vectors are represented as matrices.A matrix is defined to be a rectangular array of numbers. Example here is a Euclidean Vector in three-dimensional Euclidean space (or $R^{3}$). So a vector is represented as a column matrix.

\[a = \begin{bmatrix} a_{1}\\a_{2}\\a_{3}\ \end{bmatrix} = \begin{bmatrix} a_{1} & a_{2} &a_{3}\end{bmatrix}\]

Dot product

Algebraically, the dot product is the sum of the products of the corresponding entries of the two sequences of numbers.

if $\vec a = \left\langle {a_1,a_2,a_3} \right\rangle$ and $\vec b = \left\langle {b_1,b_2,b_3} \right\rangle$, then $\vec a \cdot \vec b = {a_1}{b_1} + {a_2}{b_2} + {a_3}{b_3}$

Geometrically, it is the product of the Euclidean magnitudes of the two vectors and the cosine of the angle between them

\[\vec a \cdot \vec b = \left\| {\vec a} \right\|\,\,\left\| {\vec b} \right\|\cos \theta\]


These definitions are equivalent when using Cartesian coordinates. Here is a simple proof that follows from trigonometry 8 and 9

Dot Product and Vector Alignment

If two vectors are in the same direction the dot product is positive and if they are in the opposite direction the dot product is negative. This can be visualized geometrically putting in the value of the Cosine angle.

So we could use the dot product as a way to find out if two vectors are aligned or not.

Imagine we have a problem of classifying if a leaf is healthy or not based on certain features of the leaf. For each leaf we have some feature vector set.

For any input feature vector in that vector space, if we have a weight vector, whose dot product with one feature vector of the set of input vectors of a certain class (say leaf is healthy) is positive and with the other set is negative, then that weight vector is splitting the feature vector hyper-plane into two.

In essence, we are using the weight vectors to split the hyper-plane into two distinctive sets.

So any new leaf, if we only extarct the same features into a feature vector; we can dot it with the trained weight vector and find out if it falls in healthy or deceased class.

Not all problems have their feature set which is linearly seperable. So this is a constraint of this system.

Perceptron. The first Artificial Neuron

The initial neural network - the Frank Rosenblatt’s perceptron was doing this and could only do this - that is finding a solution if and only if the input set was linearly separable.

And the first AI Winter

Note that the fact that Perceptron could not be trained for XOR or XNOR; which was demonstrated in 1969, by by Marvin Minsky and Seymour Papert in a famous paper,that showed that it was impossible for these classes of network to learn such non seperable feature sets.

This led to the first AI winter, as much of the hype generated intially by Frank Rosenblatt’s discovery became a disillusionment.


Modelling of the Perceptron

Here is how the Rosenblatt’s perceptron is modelled


Image source

Inputs are $x_1$ to $x_n$ , weights are some values that are learned $w_1$ to $w_n$. There is also a bias (b) which in above is -$\theta$

The bias can be modelled as a a weight $w_0$ connected to a dummy input $x_0$ set to 1.

If we ignore bias for a second, the output $y$ can be written as the sum of all inputs times the weights thresholded by the sum value being greater than zero or not.

\[y = 1 \text{ if } \sum_i w_i x_i \ge 0 \text{ else } y=0\]

The big blue circle is the primitive brain of the primitive neural network - the perceptron brain. Which is basically a function $\sigma$ (sigma).

This is what is called as an Activation Function in Neural Networks. We will see that later. This is a step function we use here, output is non continuous (and hence non-differentiable) and is either 1 or 0.

If the inputs are arranged as a column matrix and weights also arranged likewise then both the input and weights can be treated as vector and $\sum_i w_i x_i$ is same as the dot product $\mathbf{w}\cdot\mathbf{x}$. Hence the activation function can also be written as

\[\sigma (x) = \begin{cases} 1, & \text{if}\ \mathbf{w}\cdot\mathbf{x}+b \ge 0 \\ 0, & \text{otherwise} \\ \end{cases}\]

Note that dot product of two matrices (representing vectors), can be written as that transpose of one multiplied by another, $w \cdot x = w^Tx$

\[\sigma(w^Tx + b)= \begin{cases} 1, & \text{if}\ w^Tx + b \ge 0 \\ 0, & \text{otherwise} \\ \end{cases}\]

All three equations are the same.

The equation $w \cdot x \gt b$ defines all the points on one side of the hyperplane, and $w \cdot x \ge b$ all the points on the other side of the hyperplane and on the hyperplane itself.

This happens to be the very definition of “linear separability”

Thus, the perceptron allows us to separate our feature space in two convex half-spaces

Ref (12)

If we can calculate/train/learn the weights, then we can have a weight vector, which splits the input feature vectors to two regions by a hyperplane. This is the essence of the Perceptron, the intial artificial neuron.


Image source

In simple terms, it means that an unknown feature vector of an input set belonging to say Dogs and Cats, when done a Dot product with a trained weight vector, will fall into either the Dog space of the hyperplane, or the Cat space of the hyperplane. This is how neural networks do classifications.

Concept of Hyperplane