3/8/17

Normal equation


[Foreword]
If the number of features is not too big, there is a better way to solve the optimal value of the parameters theta. It's normal equation.

[Ex]
Here we use the preceding example in this article Multiple variables again.

\begin{array}{llll}
\hfill\mathrm{Size~in~feet^2 (x_1)}\hfill &
\hfill\mathrm{\#~bedrooms(x_2)}\hfill &
\hfill\mathrm{\#~ floors(x_3)}\hfill &
\hfill\mathrm{Age(x_4)}\hfill &
\hfill\mathrm{Price~$1000~(y)}\hfill

\\ \hline
\\ 2104 & 5 & 1 & 45& 460
\\ 1416 & 3&2&40&232
\\ 1534 & 3&2&30&315
\\ 852 & 2&1&36&178
\\ \end{array}

m = 4
cost function :
$J(\theta_0, \theta_1, \cdots,\theta_4) = \frac{1}{2m} \sum_{i = 1}^{4} (h_{\theta}(x^{(i)})-y^{(i)})^2$
and when the first-order differential is equal to zero, there exist a limit value.
$\frac{\partial}{\partial \theta_j} J(\theta) = 0$
Here we create a new X which is equal to [$\color{red}{x_0}, x_1,\cdots,x_4$], remember added the $x_0$ or called intercept.

\[
X =
\begin{bmatrix}              
1  & 2104 & 5 & 1 & 45  \\
1 & 1416 & 3 & 2 & 40  \\
1 & 1534 & 3 & 2 & 30  \\
1 & 852 & 2 & 1 & 36 \\
\end{bmatrix}

y =
\begin{bmatrix}
460 \\
232 \\
315 \\
178 \\
\end{bmatrix}
\]

then the normal equation is :
$$\theta = (X^T X)^{-1} y$$

We can get the optimal \theta by normal equation without any iterations, but it'll spent more time to calculate when the number of features is too large. Here is the reason, the time complexity of gradient descent is $O(k n^2)$ and normal equation is $O(n^3)$, so the calculate time will not too different when the number of features is small, but as the n grow the calculate expense will increase rapidly.

If the $(X^T X)$ is not invertible or called singular matrix, using "pinv" instead of "inv" in Octave and using "numpy.linalg.pinv" than "numpy.linalg.inv".And some common reason might be having:

  • Some features are too close related, which called redundant features.
  • $n \geq m$
The solve method is delete one of the two correlated features or using regularization.


To summarize as the table below:
$$
\begin{array}{c|cc}
 & \textbf{gradient descent} & \textbf{normal equation}  \\
\hline \\
\textbf{choose} \alpha& \text{yes} & \text{no} \\
\textbf{need iterations} & \text{many} & \text{zero}\\
\textbf{if n is large} & \text{still works well} & \text{slow}
\end{array}
$$
The advice from teacher is he'll consider to use gradient descent when n > 100,000


No comments:

Post a Comment