Leon's study notes: 2017

3/16/17

Overfitting

Continuing the preceding example of house price
Here is our training data as below plot
[Plot 1]

If we fit a linear function $\theta_0 + \theta_1 x$ to this data as below plot
[Plot 2]

It roughly means it's not fitting the training data very well, having very strong preconception. In this situation we called under fit and high bias.

If we use a quadratic function $\theta_0 + \theta_1 x + \theta_2 x^2$, then this function fit this data well.

[Plot 3]

And if we fit a fourth polynomial $\theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \theta_4 x^4$ to the data, then it's maybe look like this plot

[Plot 4]

Although, it fit the data very well, but it's not make any sense about the relationship between house price and size, this problem we called overfitting and high variance. We don't have enough data to constrain it to give us a good hypothesis.

Overfitting:

If we have too many features, the learned hypothesis may fit the training data very well, but fail to generalize to new examples.

Options:

Reduce number of features

manually select
model selection algorithm

By throwing away some of features, it throw away some information at the same time.

Regularization ,link: regularization

Keep all the features, but reduce the magnitude / value of parameters
Each will contribution a bit to predict y

Regularization

Continuing the preceding example

In our high-order polynomial function, it's the same as quadratic function except the $\theta_3$ and $\theta_4$ terms. Our goal is to minimize this function $\dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$, so if we decrease the influence from $\theta_3$ and $\theta_4$ in the cost function, just add some large number multiply the parameters,
$\dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2 + 10000 \theta_3 + 10000 \theta_4$
then these parameters will become very close to 0, we make this high-order polynomial function approach to the fitted well quadratic function and less prone to overfit.

[In linear regression]

[Def ]
$$min_\theta\ \dfrac{1}{2m}\ \left[ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2 \right]$$

$\lambda$ is the regularization parameter, it decide how much these parameters are inflated.

[Gradient descent with regularization]
\begin{align*} & \text{Repeat}\ \lbrace \newline & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline & \rbrace \end{align*}

We separate this formula to get more intuition on this equation.
$$\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$$

This equation can separate by two part, $\alpha\frac{1}{m}$ is greater then 0 in the first part, it means it''ll decrease $\theta$ a little bit, and the second part is the same with non-regularization.

[Normal equation with regularization]
\begin{align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}\end{align*}

It need notice that the matrix whose upper left entry is zero, since the intercept doesn't need regularization.The other advantage is in the m < n case, this normal equation in linear regression with regularization will be invertible, we have mentioned on this post normal-equation before, it is non-invertible or sigular when not added the regularization term in the m < n case.

[In logistic regression]

[Def]
$$J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$$

[Gradient descent with regularization]
The gradient descent form is the same with linear regression, but the hypothesis is different.

3/13/17

Multi classification - one v.s all

[Ex]
When the training data looks like this plot
[Plot 1]

We can use one v.s all classification help us.
First, we let red circle be the positive and the circle fill in purple as negative, then we may have the boundary $h_{\theta}^{(1)}(x)$ like this.
[Plot 2]

Second we let the green circle be the positive and the circle fill in purple as negative, then we can have the boundary $h_{\theta}^{(2)}(x)$ like this.
[Plot 3]

In the final, do the same thing, we can get the last boundary $h_{\theta}^{(3)}(x)$ like this.
[Plot 4]

Concretely, we fit a classifier $h_{\theta}^{(i)}(x)$ and estimate what is the probability that y = i class, or $ P(y = i | x ; \theta)$

So, to wrap up, if we want to classify an k-class problem, we need to train k classifier to solve the problem.

3/12/17

Logistic regression model

[Cost function in logistic regression]

[Def]
\begin{align*}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \; & \text{if y = 1} \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \; & \text{if y = 0}\end{align*}

Let's see the plot to get more clearly, first plot is y = 1
[Plot 1]

$h_{\theta}(x) = 1$ when y = 1, then the cost J = 0, but if $h_{\theta}(x) = 0$ then the cost $\rightarrow \infty$, for a further explanation, if $h_{\theta}(x) = 0$ or $P( y = 1 | x, \theta) = 0$, but y = 1, we'll penalize the learning algorithm by a very large cost.

When y = 0, the plot is as below
[Plot 2]

Similarly, $-log(1 - h_\theta(x))$ will penalize the learning algorithm, if the result is opposite to the actual.

It's useful to use log here, let's take a look on this plot
[Plot 3]

It will stuck at local minimum when using square function in linear regression since it's non-convex, but when transfer by log in logistic regression, it'll guarantee to find the global minimum it's convex.

To combine this two case in one formula, we can get this equation:
[Def]
$$\mathrm{Cost}(h_\theta(x),y) = - y \; \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))$$
and the whole cost function is
$$J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$$

The next step is try to find the minimum of the cost function J with gradient descent which we had introduced before gradient-descent-for-multiple-variables, notice that it's needed updating simultaneously.

3/9/17

Classification - Binary

[Foreword]

[Ex]
Email: spam / not spam
Online transaction: fraudulent
Tumor: Malignant / Benign

y $\in$ {0, 1}

0: negative class ex: not spam, Benign tumor
1: positive class ex: spam, Malignant tumor

[Linear regression is not good]
Assume this is the hypothesis: $h_{\theta}(x) = \theta^T x$
then we set 0.5 as the threshold, that means

$h_{\theta}(x) \geq 0.5$, predict y = 1
$h_{\theta}(x) < 0.5$, predict y = 0

Here is the schematic diagram
[Plot 1]

When our data obtains the red point only, this threshold works well.
But when adding a purple point around there, the fitted line will get more flatten than before and it's not a good prediction.
Another funny thing is when we using linear regression for a classication problem, the hypothesis may output a value that are much larger or less than 1 and 0, even if the training example y only have 0 and 1.
So, this is the reason we will use another method which is called : Logistic regression.

[Logistic regression]
Because we want $0 \leq h_{\theta}(x) \leq 1$, so we using a function g to change the original $h_{\theta}(x)$

$$\mathbf{h_{\theta}(x) = g(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}}$$

This formula is called "sigmoid function" or "logistic function"
We can use 0.5 as the threshold again through the sigmoid function, Let's take a loot on this plot
[Plot 2]

When $h_{\theta}(x) \geq 0.5$ means $ z \geq 0$, it'll output y = 1, in contrast $h_{\theta}(x) < 0.5 $ means $ z < 0, y = 0$

[Ex 1]
$h_{\theta}(x) = p( y = 1 | x , \theta) = $ estimate the probability that y = 1 on input x. So, if we input A patient's tumor size into the sigmoid function and get
$h_{\theta}(x) = 0.7 $
Then we can tell A there is 70% chance that this tumor being malignant.

[Ex 2]
Assume our data is below this plot and our $h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_2^2$,
[Plot 3]

we can get a divide circle when choose the fitted $\theta$ by Logistic regression as below plot.
$\theta = \lbrack -1, 0, 0, 1, 1 \rbrack ^T$
It means this function will predict y = 1 when $ -1 + x_1^2 + x_2^2 \geq 1$
[Plot 4]

This green circle is called decision boundary, the blue x means y = 1 and red o means y = 0

3/8/17

Normal equation

[Foreword]
If the number of features is not too big, there is a better way to solve the optimal value of the parameters theta. It's normal equation.

[Ex]
Here we use the preceding example in this article Multiple variables again.

\begin{array}{llll}
\hfill\mathrm{Size~in~feet^2 (x_1)}\hfill &
\hfill\mathrm{\#~bedrooms(x_2)}\hfill &
\hfill\mathrm{\#~ floors(x_3)}\hfill &
\hfill\mathrm{Age(x_4)}\hfill &
\hfill\mathrm{Price~$1000~(y)}\hfill

\\ \hline
\\ 2104 & 5 & 1 & 45& 460
\\ 1416 & 3&2&40&232
\\ 1534 & 3&2&30&315
\\ 852 & 2&1&36&178
\\ \end{array}

m = 4
cost function :
$J(\theta_0, \theta_1, \cdots,\theta_4) = \frac{1}{2m} \sum_{i = 1}^{4} (h_{\theta}(x^{(i)})-y^{(i)})^2$
and when the first-order differential is equal to zero, there exist a limit value.
$\frac{\partial}{\partial \theta_j} J(\theta) = 0$
Here we create a new X which is equal to [$\color{red}{x_0}, x_1,\cdots,x_4$], remember added the $x_0$ or called intercept.

\[
X =
\begin{bmatrix}
1 & 2104 & 5 & 1 & 45 \\
1 & 1416 & 3 & 2 & 40 \\
1 & 1534 & 3 & 2 & 30 \\
1 & 852 & 2 & 1 & 36 \\
\end{bmatrix}

y =
\begin{bmatrix}
460 \\
232 \\
315 \\
178 \\
\end{bmatrix}
\]

then the normal equation is :
$$\theta = (X^T X)^{-1} y$$

We can get the optimal \theta by normal equation without any iterations, but it'll spent more time to calculate when the number of features is too large. Here is the reason, the time complexity of gradient descent is $O(k n^2)$ and normal equation is $O(n^3)$, so the calculate time will not too different when the number of features is small, but as the n grow the calculate expense will increase rapidly.

If the $(X^T X)$ is not invertible or called singular matrix, using "pinv" instead of "inv" in Octave and using "numpy.linalg.pinv" than "numpy.linalg.inv".And some common reason might be having:

Some features are too close related, which called redundant features.
$n \geq m$

The solve method is delete one of the two correlated features or using regularization.

To summarize as the table below:
$$
\begin{array}{c|cc}
& \textbf{gradient descent} & \textbf{normal equation} \\
\hline \\
\textbf{choose} \alpha& \text{yes} & \text{no} \\
\textbf{need iterations} & \text{many} & \text{zero}\\
\textbf{if n is large} & \text{still works well} & \text{slow}
\end{array}
$$
The advice from teacher is he'll consider to use gradient descent when n > 100,000

3/7/17

Features and polynomial regression

[Ex]
In preceding example of house's price, assuming that there are only two features the frontage and depth of the house in this model. And in this case, there are some relationship between these two features, so these two feature can transfer into one new feature which equal to frontage $\times$ frontage. This feature contain the information of two features, sometimes it might get a better model.

Here is the plot with size x (frontage $\times$ frontage) and price (y)
[Plot 1 ]

If we use a quadratic model $y = \theta_0 + \theta_1 x + \theta_2 x^2$, it's plot maybe like this
[Plot 2]

Although it fit y good in the beginning, but this model doesn't make sense since this curve will decrease gradually. Then we try a cubic model $y = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3$, it's plot maybe like this
[Plot 3]

It'll be better than the model 1 cause it doesn't eventually decrease. When the model contain the high-order features, scaling features will become more important since the range of unit will increase very rapidly and if others feature are not on the similar scale it will have some trouble that we've mentioned before Feature scaling.

In the final, we can use some math method to convert our current feature and it will be helpful.

3/5/17

Gradient descent in practice - learning rate

[foreword]
When we scaling our variables with mean normalization, and running the gradient descent.
How do we confirm our model is correct?

Let's recall the main job of gradient descent, this method want to minimize the cost function J.
Then if we plot the cost function J as gradient descent runs, we could expect that this plot will decrease gradually if the learning rate is correct.

[Plot 1]

When the plot gets to 300 iterations, it looks like this curve has flattened out here.
So we can judge the gradient descent has converged or not by using this plot.

Sometimes for some application, the gradient descent need a lot times of iteration to take to converge, so another way of converging check is automatic convergence test. This method judge whether the gradient descent is converge or not by setting an $ \epsilon$, maybe $10^{-3}$. But the disadvantage of this method is that usually pretty difficult to choose the threshold $\epsilon$.

If your curve is increasing, the most common reason is the learning rate is too big, we can take a look on [Plot 2] in the preceding article Batch gradient descent, or you'll find the plot like
[Plot 2]

And it usually easy to solve when using a smaller learning rate. So if the learning rate is small enough, the cost function should decrease on every iteration. In contrast, if the learning rate is too small, it'll spent too much time to get converge and this isn't what we want.

To wrap up, we can set some learning rate candidates when using the gradient descent, and choosing the number which is small enough and having the fastest time to converge.

3/3/17

Gradient descent in practice - feature scaling

[Foreword]
When we have a problem which contains multiple features, these features' scale are needed to notice. The reason is that if these features are on the similar scale, then the gradient descents will converge more quickly than not.

[Ex]
$x_1$ : size (0~2000 $feet ^ 2$)
$x_2$ : number of bedroom (0~5)

From the preceding example, if you have these two variables $x_1$ and $x_2$ and plot the contours of cost function.
[plot 1]

Then your contour may look like this, a very tall and skin ellipses in the [plot 1], and if you run the gradient descent on this cost function, it may oscillate back and forth and take a long time to get the global minimum.

So, as we mentioned before, we can scale these variables into the similar range.
$x_1 := \frac{x_1}{2000}$
$x_2 := \frac{x_2}{5}$
This method will let these variables' range between 0 and 1.
[plot 2]

Then the contours may less skewed and look like circles in the [plot 2], and when you run gradient descent on this cost function again, the path to the global minimum will be much more directly and save the iterating time.

Generally, we want to set these variables into approximately -1 ~ +1, but these two numbers -1 and +1 are not so important, the key point is letting these variables on the similar scale, so if the range end up with -2 ~ +0.5, it is OK.

[Mean normalization]

[Ex]
$x_1$ : size (0~2000 $feet ^ 2$)
$x_2$ : number of bedroom (0~5)
$\mu_1$ : 1000
$\mu_2$ : 2

[Def]
$$\frac{x_i - \mu_i}{s_i}$$

The $s_i$ here means the range of $x_i$, and it can be the standard deviation of $x_i$ , too. In my explanation, the formula of range is more simpler than standard deviation and the performance is not far from.

At last, the feature scaling method doesn't have to be too exact to get the global minimum faster.

2/27/17

Gradient descent for multiple variables

Continue from the preceding example

[Ex]
\begin{array}{llll}
\hfill\mathrm{Size~in~feet^2 (x_1)}\hfill &
\hfill\mathrm{\#~bedrooms(x_2)}\hfill &
\hfill\mathrm{\#~ floors(x_3)}\hfill &
\hfill\mathrm{Age(x_4)}\hfill &
\hfill\mathrm{Price~$1000~(y)}\hfill

\\ \hline
\\ 2104 & 5 & 1 & 45& 460
\\ 1416 & 3&2&40&232
\\ 1534 & 3&2&30&315
\\ 852 & 2&1&36&178
\\ ... & ...& ...& ...& ...
\\ \end{array}

Notation:

n = number of variables
m = number of examples
$x^{(i)}$ = input variables of $i^{th}$ training example.
$x^{(i)}_j$ = value of input variable j in $i^{th}$ training example.

Hypothesis: $ h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_4 x_4$

For convenience of notation, define $x_0 = 1$, it means $ x^{(i)}_0 = 1$, so the hypothesis can transfer as:

$h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_4 x_4 $

$ = \theta^T x $

So, the definition is as below:

[Def]
$$\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; & \text{for j := 0...n}\newline \rbrace\end{align*}$$

2/26/17

Batch gradient descent

[Ex]
Continue from the preceding article about cost function, here we will use this method to minimize the cost function, and using just two parameters $ \theta_0 $ and $\theta_1$ . It's a more general algorithm, and not only in linear regression.

outline:

start with some $ \theta_0 $ and $\theta_1$
keep changing the $ \theta_0 $ and $\theta_1$ to reduce the $ J(\theta_0, \theta_1)$ and end up at a minimum

[Def]
repeat until converge {
$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $ (for j = 0 and j = 1)
}

$ \alpha$ : learning rate, it means how big step we take with creating descent.
Each step of gradient descent uses ALL the training examples.

Notice: we need updating these parameters "simultaneously" !

[Ex]
Back to our example, we take one parameter $ \theta_1$ now to get this algorithm more simpler.
[Plot 1]

Assuming this is our cost function, and if our $ \theta_1$ starting from the blue point, since the slope is positive, so the new $ \theta_1$ from the gradient descent will decrease and move to the green point (minimum), the moving speed depend on the $\alpha$. In contrast, if our $ \theta_1$ starting from the red point, it'll increase and move to the green point, too.

Although the moving speed is depend on the $\alpha$, when $\alpha$ is bigger the converging speed is more faster, but if the $\alpha$ is too big, the result will be diverging.
[Plot 2]

On the contrary, if the $\alpha$ is too small, it'll take too much time to converge.
[Plot 3]

Here is the other property of gradient descent, if our initial $ \theta_1$ is at the local optimum, then it leaves $ \theta_1$ unchanged, because the slope will nearly equal to zero.
[Plot 4]

So, as we approach a local minimum, the gradient descent will automatically take smaller steps, this is why we don't need to decrease $\alpha$ over time.

[For linear regression]
[Def]
$$\frac{\partial}{\partial \theta_j}J(\theta_0,\theta_1) = \frac{\partial}{\partial \theta_j}\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$$

$ \begin{align*} \text{repeat until convergence: } \lbrace & \newline \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \newline \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \newline \rbrace& \end{align*}$

2/19/17

Cost Function

[Ex]
Here is our training set and hypothesis as below:
\begin{array}{ll}
\hfill\mathrm{Size~in~feet^2 (x)}\hfill & \hfill\mathrm{Price($)~in~1000's(y)}\hfill
\\ \hline
\\ 2104 & 460
\\ 1416 & 232
\\ 1534 & 315
\\ 852 & 178
\\ ... & ...
\\ \end{array}

In this example, we want to fit a straight line to predict the house price, then we set a simple model.
Hypothesis: $ h_\theta(x) = \theta_0 + \theta_1 x$

It's a simple regression problem, you can choose any number for $ \theta_0$ and $ \theta_1$, so that we can get the $ h_\theta(x)$ and it's meaning the value which the model predict to the input x. As a prediction model, we want the difference between $ h_\theta(x) $ and y to be small, in other words
is this hypothesis good fit to the data?

We can measure the accuracy of our hypothesis function by using a cost function or called squared error function, in other words it's almost the same as MSE in statistics.

[Def]
$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$$
m: the number of training example
$ h_\theta(x) $: form of hypothesis

Leon's study notes