Processing math: 100%

2/26/17

Batch gradient descent

[Ex]
Continue from the preceding article about cost function, here we will use this method to minimize the cost function, and using just two parameters θ0 and  θ1 . It's a more general algorithm, and not only in linear regression.

outline:

  • start with some  θ0 and  θ1 
  • keep changing the  θ0 and  θ1  to reduce the J(θ0,θ1) and end up at a minimum

[Def]
repeat until converge {
θj:=θjαθjJ(θ0,θ1)        (for j = 0 and j = 1)
}

α : learning rate, it means how big step we take with creating descent.
Each step of gradient descent uses ALL the training examples.

Notice: we need updating these parameters "simultaneously" !

[Ex]
Back to our example, we take one parameter θ1 now to get this algorithm more simpler.
[Plot 1]


Assuming this is our cost function, and if our  θ1 starting from the blue point, since the slope is positive, so the new θ1 from the gradient descent will decrease and  move to the green point (minimum), the moving speed depend on the α. In contrast, if our  θ1 starting from the red point, it'll increase and move to the green point, too.

Although the moving speed is depend on the α, when α is bigger the converging speed is more faster, but if the α is too big, the result will be diverging.
[Plot 2]
On the contrary, if the α is too small, it'll take too much time to converge.
[Plot 3]

Here is the other property of gradient descent, if our initial  θ1 is at the local optimum, then it leaves θ1 unchanged, because the slope will nearly equal to zero.
[Plot 4]
So, as we approach a local minimum, the gradient descent will automatically take smaller steps, this is why we don't need to decrease α over time.

[For linear regression]
[Def]
θjJ(θ0,θ1)=θj12mmi=1(hθ(x(i))y(i))2


repeat until convergence: {θ0:=θ0α1mmi=1(hθ(xi)yi)θ1:=θ1α1mmi=1((hθ(xi)yi)xi)}






No comments:

Post a Comment