Processing math: 100%

2/27/17

Gradient descent for multiple variables

Continue from the preceding example

[Ex]
Size in feet2(x1)# bedrooms(x2)# floors(x3)Age(x4)Price $1000 (y)2104514546014163240232153432303158522136178...............


Notation:

  • n = number of variables
  • m = number of examples
  • x(i) = input variables of  ith training example.
  • x(i)j = value of input variable j  in ith training example.
Hypothesis: hθ(x)=θ0+θ1x1+θ2x2+θ3x3+θ4x4

For convenience of notation, define x0=1, it means x(i)0=1, so the hypothesis can transfer as:


hθ(x)=θ0x0+θ1x1+θ2x2+θ3x3+θ4x4
         =θTx

So, the definition is as below:

[Def]
repeat until convergence:{θj:=θjα1mmi=1(hθ(x(i))y(i))x(i)jfor j := 0...n}

2/26/17

Batch gradient descent

[Ex]
Continue from the preceding article about cost function, here we will use this method to minimize the cost function, and using just two parameters θ0 and  θ1 . It's a more general algorithm, and not only in linear regression.

outline:

  • start with some  θ0 and  θ1 
  • keep changing the  θ0 and  θ1  to reduce the J(θ0,θ1) and end up at a minimum

[Def]
repeat until converge {
θj:=θjαθjJ(θ0,θ1)        (for j = 0 and j = 1)
}

α : learning rate, it means how big step we take with creating descent.
Each step of gradient descent uses ALL the training examples.

Notice: we need updating these parameters "simultaneously" !

[Ex]
Back to our example, we take one parameter θ1 now to get this algorithm more simpler.
[Plot 1]


Assuming this is our cost function, and if our  θ1 starting from the blue point, since the slope is positive, so the new θ1 from the gradient descent will decrease and  move to the green point (minimum), the moving speed depend on the α. In contrast, if our  θ1 starting from the red point, it'll increase and move to the green point, too.

Although the moving speed is depend on the α, when α is bigger the converging speed is more faster, but if the α is too big, the result will be diverging.
[Plot 2]
On the contrary, if the α is too small, it'll take too much time to converge.
[Plot 3]

Here is the other property of gradient descent, if our initial  θ1 is at the local optimum, then it leaves θ1 unchanged, because the slope will nearly equal to zero.
[Plot 4]
So, as we approach a local minimum, the gradient descent will automatically take smaller steps, this is why we don't need to decrease α over time.

[For linear regression]
[Def]
θjJ(θ0,θ1)=θj12mmi=1(hθ(x(i))y(i))2


repeat until convergence: {θ0:=θ0α1mmi=1(hθ(xi)yi)θ1:=θ1α1mmi=1((hθ(xi)yi)xi)}






2/19/17

Cost Function

[Ex]
Here is our training set and hypothesis as below:
Size in feet2(x)Price($) in 1000s(y)210446014162321534315852178......


In this example, we want to fit a straight line to predict the house price, then we set a simple model.
Hypothesis: hθ(x)=θ0+θ1x

It's a simple regression problem, you can choose any number for θ0 and θ1, so that we can get the hθ(x) and it's meaning the value which the model predict to the input x. As a prediction model, we want the difference between hθ(x) and y to be small, in other words
is this hypothesis good fit to the data?

We can measure the accuracy of our hypothesis function by using a cost function or called squared error function, in other words it's almost the same as MSE in statistics.

[Def]
J(θ0,θ1)=12mmi=1(ˆyiyi)2=12mmi=1(hθ(xi)yi)2

m: the number of training example
hθ(x): form of hypothesis