Leon's study notes: Regularization

Continuing the preceding example

In our high-order polynomial function, it's the same as quadratic function except the $\theta_3$ and $\theta_4$ terms. Our goal is to minimize this function $\dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$, so if we decrease the influence from $\theta_3$ and $\theta_4$ in the cost function, just add some large number multiply the parameters,
$\dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2 + 10000 \theta_3 + 10000 \theta_4$
then these parameters will become very close to 0, we make this high-order polynomial function approach to the fitted well quadratic function and less prone to overfit.

[In linear regression]

[Def ]
$$min_\theta\ \dfrac{1}{2m}\ \left[ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2 \right]$$

$\lambda$ is the regularization parameter, it decide how much these parameters are inflated.

[Gradient descent with regularization]
\begin{align*} & \text{Repeat}\ \lbrace \newline & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline & \rbrace \end{align*}

We separate this formula to get more intuition on this equation.
$$\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$$

This equation can separate by two part, $\alpha\frac{1}{m}$ is greater then 0 in the first part, it means it''ll decrease $\theta$ a little bit, and the second part is the same with non-regularization.

[Normal equation with regularization]
\begin{align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}\end{align*}

It need notice that the matrix whose upper left entry is zero, since the intercept doesn't need regularization.The other advantage is in the m < n case, this normal equation in linear regression with regularization will be invertible, we have mentioned on this post normal-equation before, it is non-invertible or sigular when not added the regularization term in the m < n case.

[In logistic regression]

[Def]
$$J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$$

[Gradient descent with regularization]
The gradient descent form is the same with linear regression, but the hypothesis is different.

Leon's study notes

3/16/17

Regularization

No comments:

Post a Comment

Blog Archive

About Me