What is back propagation?
Last updated
Last updated
In machine learning, back-propagation is a gradient estimation method used to train neural network models. The gradient estimate is used by the optimization algorithm to compute the network parameter updates.
- Wikipedia
Basically, used to train a neural network.
We'll be referencing the above diagram throughout the notes
So in above neural network, each arrow represents a weight and each perceptron has a bias. So basically, back propagation algorithm is used to find best values of all the weights and biases so that the neural network performs as good as it can.
For this setting, we'll assume all the activation functions to be linear.
IQ | CGPA | Package (LPA) |
---|---|---|
80 | 8 | 5 |
60 | 5 | 3 |
110 | 7 | 8 |
Consider the above table as an example
Initialize all the weights and biases Either make them random Or initialize all the weights to be 1 and biases to be 0 (we'll do this)
Select a random student (row) We're selecting the first row, whose package is 5
Predict the Package (target column) using dot product As the initial weights are very random, the result will be incorrect Suppose the answer we got is 17
Choose a loss function We're choosing -> In our case, the answer will be
Update the weights and biases using gradient descent using the following formula:
Where = Learning rate
So now the final output or depends on numerous variables.
Here, depends upon 5 variables, i.e. . Now the weights & and bias are simple variables, however, the results & depend on their weights and biases. In this way, a hierarchy is formed.
During the whole algorithm, we select a random student (row) and then we calculate it's loss and using the loss we update the weights and biases.
Suppose we have to calculate and update the weight , we will have to perform the following calculations:
Now as we already have the value of & , we only have to calculate the value of . This value is also known as the gradient and this is why it's known as that. Now in the algorithm, each weight and bias is updated, so basically in our neural network, we'll calculate such gradient 9 times as we have 6 weights and 3 biases.
Now what does actually mean? This term signifies the amount of change occurs in when we change .
But you might notice, doesn't directly depends upon but rather depends upon and then as we already know, directly depends upon . So basically it means that indirectly depends upon (and that is why we're using derivatives and not direct derivatives)
So basically, update all the weights and biases for each row and after we're done with the first iteration, the weights and biases will predict answer a little better than previous turn.
However, it won't be as correct as we want, so for that, we'll let this happen numerous times on our dataset for it to get better each time.
Number of times the algorithm will go through the whole dataset is known as epoch. With each epoch, the loss will be reduced and the model gets better with time.