Suppose we have scaled the inputs for our one parameter linear regression problem and ... x = -0.05 # our input variable y = -0.05 # our output variable w = 1.00 # the actual parameter w_hat = 0.90 # our current estimate of the parameter learning_rate = 0.1 a) If y_hat = w_hat * x, what is the value of the mean squared error (mse) loss function for this example? y_hat = w_hat * x = 0.9 * -0.05 = -0.045 ... so ... mse = (y - y_hat)**2 = (-0.05 - (-0.045))**2 = (-0.005)**2 = 0.000025 b) What is the gradient of the mean squared error loss with respect to the weight estimate w_hat? Don't forget to use the chain rule: gradient = (partial derivative of loss with respect to activation) * (partial derivative of activation with respect to product) * (partial derivative of product with respect to weight) gradient(loss, w_hat) = gradient(loss, activation) * gradient(activation, product) * gradient(product, weight) = (2 * (y_hat - y)) * 1 * x = (2 * (-0.045 - (-0.05))) * 1 * (-0.05) = (2 * 0.005) * 1 * (-0.05) = 0.01 * 1 * (-0.05) = - 0.0005 c) What is the updated estimate of w_hat? Don't forget that we are using gradient descent; i.e. new_weight = old_weight - learning_rate * gradient. new_weight = old_weight - learning_rate * gradient(loss, w_hat) = 0.9 - 0.1 * (-0.0005) = 0.9 + 0.00005 = 0.90005 d) What is the value of the mean squared error loss function for this example, after updating the weight? y_hat = w_hat * x = 0.90005 * -0.05 = -0.0450025 ... so ... mse = (y - y_hat)**2 = (-0.05 - (-0.0450025))**2 = (-0.0049975)**2 = 0.00002497500625 e) Has "learning" reduced the loss function? Yes, because mean squared error has been reduced from ... 0.00002500000000 to ... 0.00002497500625; reducing error by ... 0.00000002499375 [which is a 0.099975% reduction in error (almost a tenth of 1%)]