Leon's study notes: February 2017

2/27/17

Gradient descent for multiple variables

Continue from the preceding example

[Ex]
\begin{array}{llll}
\hfill\mathrm{Size~in~feet^2 (x_1)}\hfill &
\hfill\mathrm{\#~bedrooms(x_2)}\hfill &
\hfill\mathrm{\#~ floors(x_3)}\hfill &
\hfill\mathrm{Age(x_4)}\hfill &
\hfill\mathrm{Price~$1000~(y)}\hfill

\\ \hline
\\ 2104 & 5 & 1 & 45& 460
\\ 1416 & 3&2&40&232
\\ 1534 & 3&2&30&315
\\ 852 & 2&1&36&178
\\ ... & ...& ...& ...& ...
\\ \end{array}

Notation:

n = number of variables
m = number of examples
$x^{(i)}$ = input variables of $i^{th}$ training example.
$x^{(i)}_j$ = value of input variable j in $i^{th}$ training example.

Hypothesis: $ h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_4 x_4$

For convenience of notation, define $x_0 = 1$, it means $ x^{(i)}_0 = 1$, so the hypothesis can transfer as:

$h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_4 x_4 $

$ = \theta^T x $

So, the definition is as below:

[Def]
$$\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; & \text{for j := 0...n}\newline \rbrace\end{align*}$$

2/26/17

Batch gradient descent

[Ex]
Continue from the preceding article about cost function, here we will use this method to minimize the cost function, and using just two parameters $ \theta_0 $ and $\theta_1$ . It's a more general algorithm, and not only in linear regression.

outline:

start with some $ \theta_0 $ and $\theta_1$
keep changing the $ \theta_0 $ and $\theta_1$ to reduce the $ J(\theta_0, \theta_1)$ and end up at a minimum

[Def]
repeat until converge {
$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $ (for j = 0 and j = 1)
}

$ \alpha$ : learning rate, it means how big step we take with creating descent.
Each step of gradient descent uses ALL the training examples.

Notice: we need updating these parameters "simultaneously" !

[Ex]
Back to our example, we take one parameter $ \theta_1$ now to get this algorithm more simpler.
[Plot 1]

Assuming this is our cost function, and if our $ \theta_1$ starting from the blue point, since the slope is positive, so the new $ \theta_1$ from the gradient descent will decrease and move to the green point (minimum), the moving speed depend on the $\alpha$. In contrast, if our $ \theta_1$ starting from the red point, it'll increase and move to the green point, too.

Although the moving speed is depend on the $\alpha$, when $\alpha$ is bigger the converging speed is more faster, but if the $\alpha$ is too big, the result will be diverging.
[Plot 2]

On the contrary, if the $\alpha$ is too small, it'll take too much time to converge.
[Plot 3]

Here is the other property of gradient descent, if our initial $ \theta_1$ is at the local optimum, then it leaves $ \theta_1$ unchanged, because the slope will nearly equal to zero.
[Plot 4]

So, as we approach a local minimum, the gradient descent will automatically take smaller steps, this is why we don't need to decrease $\alpha$ over time.

[For linear regression]
[Def]
$$\frac{\partial}{\partial \theta_j}J(\theta_0,\theta_1) = \frac{\partial}{\partial \theta_j}\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$$

$ \begin{align*} \text{repeat until convergence: } \lbrace & \newline \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \newline \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \newline \rbrace& \end{align*}$

2/19/17

Cost Function

[Ex]
Here is our training set and hypothesis as below:
\begin{array}{ll}
\hfill\mathrm{Size~in~feet^2 (x)}\hfill & \hfill\mathrm{Price($)~in~1000's(y)}\hfill
\\ \hline
\\ 2104 & 460
\\ 1416 & 232
\\ 1534 & 315
\\ 852 & 178
\\ ... & ...
\\ \end{array}

In this example, we want to fit a straight line to predict the house price, then we set a simple model.
Hypothesis: $ h_\theta(x) = \theta_0 + \theta_1 x$

It's a simple regression problem, you can choose any number for $ \theta_0$ and $ \theta_1$, so that we can get the $ h_\theta(x)$ and it's meaning the value which the model predict to the input x. As a prediction model, we want the difference between $ h_\theta(x) $ and y to be small, in other words
is this hypothesis good fit to the data?

We can measure the accuracy of our hypothesis function by using a cost function or called squared error function, in other words it's almost the same as MSE in statistics.

[Def]
$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$$
m: the number of training example
$ h_\theta(x) $: form of hypothesis

Leon's study notes