[Ex]
Continue from the preceding article about cost function, here we will use this method to minimize the cost function, and using just two parameters
θ0 and
θ1 . It's a more general algorithm, and not only in linear regression.
outline:
- start with some θ0 and θ1
- keep changing the θ0 and θ1 to reduce the J(θ0,θ1) and end up at a minimum
[Def]
repeat until converge {
θj:=θj−α∂∂θjJ(θ0,θ1) (for j = 0 and j = 1)
}
α : learning rate, it means how big step we take with creating descent.
Each step of gradient descent uses
ALL the training examples.
Notice: we need updating these parameters "simultaneously" !
[Ex]
Back to our example, we take one parameter
θ1 now to get this algorithm more simpler.
[Plot 1]
Assuming this is our cost function, and if our
θ1 starting from the blue point, since the slope is positive, so the new
θ1 from the gradient descent will decrease and move to the green point (minimum), the moving speed depend on the
α. In contrast, if our
θ1 starting from the red point, it'll increase and move to the green point, too.
Although the moving speed is depend on the
α, when
α is bigger the converging speed is more faster, but if the
α is too big, the result will be diverging.
[Plot 2]
On the contrary, if the
α is too small, it'll take too much time to converge.
[Plot 3]
Here is the other property of gradient descent, if our initial
θ1 is at the local optimum, then it leaves
θ1 unchanged, because the slope will nearly equal to zero.
[Plot 4]
So, as we approach a local minimum, the gradient descent will automatically take smaller steps, this is why we don't need to decrease
α over time.
[For linear regression]
[Def]
∂∂θjJ(θ0,θ1)=∂∂θj12mm∑i=1(hθ(x(i))−y(i))2
repeat until convergence: {θ0:=θ0−α1mm∑i=1(hθ(xi)−yi)θ1:=θ1−α1mm∑i=1((hθ(xi)−yi)xi)}