IT (is) Explained
Information Technology Explained

Chapter 1: Vectors, Dot Products and the Perceptron

Let’s go through some basics first and jump to the first artificial neuron the Perceptron.

Q. Why are inputs and weights modelled as Vectors in Neural Networks ?

To understand the maths or modelling of the neural network, it is best to start from the beginning when things were simple. The earliest neural network - the Rosenblatt’s perceptron was the first to introduce the concept of using vectors and the property of dot product to split hyperplanes of input feature vectors. These are the fundamentals that are still used today. There is a lot of terms above, and the rest of the article is basically to illustrate these concepts from the very basics.

Before we talk about why NN inputs and weights are modelled as vectors (and represented as matrices) let us first see what these mathematical concepts mean geometrically. This will help us in understanding the intuition of these when they are used in other contexts/ in higher dimensions.

Q. What does a Vector mean?

A Vector is meaningless¹ unless you specify the context - Vector Space. Assume we are thinking about something like force vector, the context is a 2D or 3D Euclidean world

vector2D

Source: 3Blue1Brown’s video on Vectors

vector3D

¹Maths is abstract and meaningless unless you apply it to a context- this is a reason why you will get tripped if you try to get just a mathematical intuition about the neural network

The easiest way to understand it is in a geometric context, say 2D or 3D cartesian coordinates, and then extrapolate it. This is what we will try to do here.

Q. What is the connection between Matrices and Vectors?

Vectors are represented as matrices. Example here is a Euclidean Vector in three-dimensional Euclidean space (or $R^{3}$), represented as a column vector (usually) or row vector

\[a = \begin{bmatrix} a_{1}\\a_{2}\\a_{3}\ \end{bmatrix} = \begin{bmatrix} a_{1} & a_{2} &a_{3}\end{bmatrix}\]

Q. What is a Dot product? and what does it signify ?

First the dry definitions. Algebraically, the dot product is the sum of the products of the corresponding entries of the two sequences of numbers.

if $\vec a = \left\langle {a_1,a_2,a_3} \right\rangle$ and $\vec b = \left\langle {b_1,b_2,b_3} \right\rangle$, then $\vec a \cdot \vec b = {a_1}{b_1} + {a_2}{b_2} + {a_3}{b_3}$

Geometrically, it is the product of the Euclidean magnitudes of the two vectors and the cosine of the angle between them

\[\vec a \cdot \vec b = \left\| {\vec a} \right\|\,\,\left\| {\vec b} \right\|\cos \theta\]

dotproduct

These definitions are equivalent when using Cartesian coordinates. Here is a simple proof that follows from trigonometry - http://tutorial.math.lamar.edu/Classes/CalcII/DotProduct.aspx

Related Link https://sergedesmedt.github.io/MathOfNeuralNetworks/VectorMath.html#learn_vector_math_diff

Dot Product and Vector Alignment

If two vectors are in the same direction the dot product is positive and if they are in the opposite direction the dot product is negative.

So you could use the dot product as a way to find out if two vectors are aligned or not. That is for any two distinct sets of input feature vectors in a vector space ( say we are classifying if a leaf is healthy or not based on certain features of the leaf), we can have a weight vector, whose dot product with one input feature vector of the set of input vectors of a certain class (say leaf is healthy) is positive and with the other set is negative. In essence, we are using the weight vectors to split the hyper-plane into two distinctive sets.

The first Artificial Neuron - Perceptron

The initial neural network - the Rosenblatt’s perceptron was doing this and could only do this - that is finding a solution if and only if the input set was linearly separable. (that constraint led to an AI winter and frosted the hopes/hype generated by the Perceptron when it was proved that it could not solve for XNOR not linearly separable)

Here is how the Rosenblatt’s perceptron is modelled

perceptron2

Image source https://maelfabien.github.io/deeplearning/Perceptron/#the-classic-model

Inputs are $x_1$ to $x_n$ , weights are some values that are learned $w_1$ to $w_n$. There is also a bias (b) which in above is -$\theta$ The bias can be modelled as a a weight $w_0$ connected to a dummy input $x_0$ set to 1.

If we ignore bias for a second the output $y$ can be written as the sum of all inputs times the weights thresholded by the sum value being greater than zero or not.

\[y = 1 \text{ if } \sum_i w_i x_i \ge 0 \text{ else } y=0\]

The big blue circle is the primitive brain of the primitive neural network - the perceptron brain. Which is basically a function $\sigma$ (sigma).

This is what is called as an Activation Function in Neural Networks. We will see that later. This is a step function, we use here, output is non continuous (and hence non-differentiable) and is either 1 or 0.

If the inputs are arranged as a column matrix and weights also arranged likewise then both the input and weights can be treated as vector and $\sum_i w_i x_i$ is same as the dot product $\mathbf{w}\cdot\mathbf{x}$. Hence the activation function can also be written as

\[\sigma (x) = \begin{cases} 1, & \text{if}\ \mathbf{w}\cdot\mathbf{x}+b \ge 0 \\ 0, & \text{otherwise} \\ \end{cases}\]

Note that dot product of two matrices (representing vectors), can be written as that transpose of one multiplied by another, $w \cdot x = w^Tx$

\[\sigma(w^Tx + b)= \begin{cases} 1, & \text{if}\ w^Tx + b \ge 0 \\ 0, & \text{otherwise} \\ \end{cases}\]

All three equations are the same.

The equation $w \cdot x \gt b$ defines all the points on one side of the hyperplane, and $w \cdot x \ge b$ all the points on the other side of the hyperplane and on the hyperplane itself. This happens to be the very definition of “linear separability” Thus, the perceptron allows us to separate our feature space in two convex half-spaces.

(From https://sergedesmedt.github.io/MathOfNeuralNetworks/RosenblattPerceptronArticle.html)

If we can calculate the weights then we can have a weight vector, which splits the input feature vectors to two regions by a hyperplane.

cornellperceptron

Image source https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/images/perceptron/perceptron_img1.png

In simple terms, it means that an unknown feature vector of an input set belonging to say Dogs and Cats, when done a Dot product with a trained weight vector, will fall into either the Dog space of the hyperplane, or the Cat space of the hyperplane. This is how neural networks do classifications.

(Hyperplane)