Introduction

A neural network in machine learning is a computational model inspired by the structure and functioning of the human brain. It consists of interconnected nodes, or “neurons,” organized into layers. It is said that it is inspired by the structure and functioning of the human brain but I believe the latter to be false as we do not even know how human brain works. It is tough to make a concrete statement on how human brain works so let us just begin with the structure and components of a neural network –

Types of Neural Networks (6 Most Used)

Benefits of Deep Learning

The above benefits indicate that older algorithms become inefficient over time but deep learning continues to perform well.

How Neural Network Works?

At last, we arrive at the real deal. Each neuron takes input data, processes and passes it on to the other. That is it. Thank you for reading. Just Kidding, let us take a better look. As the data moves through the network, the connections between the nodes are strengthened or weakened, depending on the patterns in the data. This allows the network to learn from the data and make predictions or decisions based on what it has learned.

Look at the image below –

source: 3blue1brown

The white pixels together make up the digit 3. This is our input. This image is 28×28 in dimension. This means it has total of 784 pixels. In my notebook, the data has a shape of (42000, 785) meaning it has 42000 rows and 785 columns. 785 columns represent 784 features (pixels in a 28×28 image) and 1 label.

You can find everything in the notebooks attached.

Now if you look at the notebooks where the libraries are used and MNIST dataset is downloaded and used directly from keras data sets using tf.keras.datasets.mnist.load_data(), you will find the shape of data initially is (60000, 28, 28). We reshape this to (60000, 784). It is visualized below –

First layer is Input layer with 784 features / pixels

Now our model consist of only a single hidden layer but I could not find any animations representing that so imagine there are only three layers (Input, Hidden Layer, Output Layer).

Each pixel has a value from 0-255 depending on the brightness of the pixel.

We convert the values to 0-1 by dividing the values by 255.

Below is the image of the network and signals going through neurons for better understanding. White are near 1 and darker are near 0.

Process Step By Step

Forward Propagation

Forward propagation in a neural network involves passing input data through the network’s layers, where each layer applies weights, biases, and activation functions to compute the output. The process continues from the input layer to the output layer, producing the final prediction or result.

Refer to the notebook –

Input layer a[0]a[0] will have 784 units corresponding to the 784 pixels in each 28×28 input image. A hidden layer a[1]a[1] will have 10 units with ReLU activation, and finally our output layer a[2]a[2] will have 10 units corresponding to the ten digit classes with softmax activation.

It looks complex but it is simple. Similar to what we did in regression, wx+b. We apply this for each layer with an activation and send that data forward in the next. The hidden layer uses ReLU activation function which is also implemented in the notebook. We will go deeper in activations in later blogs. For output layer, we use softmax activation. You can imagine softmax activation as a generalization of logistic regression.

NOTE: Comments in the notebook can be very useful.

Forward propagation

Goal: To calculate the output of the neural network by passing the input through the network layers.

  1. Input to Hidden Layer Calculation:

    • W[1]: Weights matrix for the hidden layer, shape (10, 784).
    • X (or A[0]): Input data, shape (784, m), where 784 is the number of input features (pixels), and m is the number of examples.
    • b[1]: Bias vector for the hidden layer, shape (10, 1).
    • Z[1]: The weighted input to the hidden layer, shape (10, m).
  2. Activation of the Hidden Layer:
    A[1]=gReLU(Z[1])
    • A[1]: Activations of the hidden layer, shape (10, m), using the ReLU activation function.
  3. Hidden Layer to Output Layer Calculation:
    Z[2]=W[2]A[1]+b[2]
    • W[2]: Weights matrix for the output layer, shape (10, 10).
    • A[1]: Activations from the hidden layer, shape (10, m).
    • b[2]: Bias vector for the output layer, shape (10, 1).
    • Z[2]: The weighted input to the output layer, shape (10, m).
  4. Output Layer Activation (Softmax):A[2]=gsoftmax(Z[2])
    • A[2]: The output of the network, shape (10, m), where each column represents the probabilities of the input belonging to each of the 10 digit classes.

Backward Propagation

Backward propagation involves computing the gradient of the loss function with respect to each weight by propagating the error backward through the network. This gradient is then used to update the weights and biases to minimize the loss.

Goal: To calculate the gradients of the loss function with respect to the weights and biases, which will be used to update the parameters.

  1. Gradient of the Loss with Respect to Z[2]:dZ[2]=A[2]−Y
    • dZ[2]: The error term for the output layer, shape (10, m).
    • Y: True labels, one-hot encoded, shape (10, m).
  2. Gradient of the Loss with Respect to W[2]:dW[2]=1mdZ[2]A[1]T
    • dW[2]: Gradient with respect to the weights of the output layer, shape (10, 10).
  3. Gradient of the Loss with Respect to b[2]:dB[2]=1/m∑dZ[2]
    • dB[2]: Gradient with respect to the biases of the output layer, shape (10, 1).
  4. Gradient of the Loss with Respect to Z[1]:dZ[1]=W[2]TdZ[2]⋅g′[1](Z[1])
    • dZ[1]: The error term for the hidden layer, shape (10, m).
    • g'[1](Z[1]): Derivative of the ReLU activation function applied to Z[1].
  5. Gradient of the Loss with Respect to W[1]:dW[1]=1mdZ[1]A[0]T
    • dW[1]: Gradient with respect to the weights of the hidden layer, shape (10, 784).
  6. Gradient of the Loss with Respect to b[1]:dB[1]=1/m∑dZ[1]
    • dB[1]: Gradient with respect to the biases of the hidden layer, shape (10, 1).

Please refer to the lines in the expression above, Writing mathematics in text is tough and should not be legal.

Parameter Updates

Just update the parameters by subtracting the derivative (don’t forget the learning rate).

NOTE: Reminder to refer notebook for better experience.

Putting It All Together (Gradient Descent)

In gradient descent, we put everything together. This function performs gradient descent to optimize the weights and biases of the network over a specified number of iterations.

  1. Initialize Parameters:
    • W1, b1, W2, b2 = init_params() initializes the weights and biases for the network.
  2. Iterate:
    • The loop runs for the specified number of iterations.
  3. Forward Propagation:
    • Z1, A1, Z2, A2 = forward_prop(W1, b1, W2, b2, X) computes the forward pass, obtaining activations for each layer.
  4. Backward Propagation:
    • dW1, db1, dW2, db2 = backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y) computes the gradients of the loss with respect to the weights and biases.
  5. Update Parameters:
    • W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha) updates the weights and biases using the computed gradients and the learning rate alpha.
  6. Monitor Progress:
    • Every 10 iterations, the function prints the current iteration number and the accuracy of the model on the training data.

Output: The optimized weights and biases (W1, b1, W2, b2) after the specified number of iterations.

Making Predictions

After training our model, it is the moment of truth. Make predictions on the data and test. You can also plot the image as I did in the notebook for visualization.

4 examples are attached in the notebook.

Neural Network Implementation from Scratch.
NOTE: Notebook containing tensorflow implementation is shared at the end.

Conclusion

Both notebooks are attached, one with neural network implementation from scratch and another using tensorflow. Same layer architecture are used in both the notebooks for better comparison.

Some terms, functions and optimizers are not explained in the blog as it will be bigger and too much at a time but the tensorflow version as you can see is more accurate and faster because of the these things. For example, ADAM optimizer over gradient descent provides better speed and accuracy as it adjust the learning rate accordingly. We will talk about these later.

To utilize this to a better extent, try creating a neural network yourself. It looks complex from outside but it is much simpler (basic neural network).

We will keep talking about machine learning.
Until Next Time ^^

Notebook (Implementation from scratch) – https://www.kaggle.com/code/innomight/digit-recognizer-from-scratch-with-explanation

Notebook (Implementation using TensorFlow) – https://www.kaggle.com/code/innomight/digit-recognizer-using-tensorflow

Previous Blogs –
Regression Model – https://vaibhavshrivastava.com/how-linear-regression-model-actually-works/
Feature Engineering – https://vaibhavshrivastava.com/the-importance-of-feature-engineering-in-machine-learning/
Logistic Regression – https://vaibhavshrivastava.com/logistic-regression-from-basics-to-code/
Scaling Algorithms – https://vaibhavshrivastava.com/scaling-features-how-scaler-works-and-choosing-the-right-scaler/