3/3/17

Gradient descent in practice - feature scaling

[Foreword]
When we have a problem which contains multiple features, these features' scale are needed to notice. The reason is that if these features are on the similar scale, then the gradient descents will converge more quickly than not.

[Ex]
$x_1$ : size (0~2000 \(feet ^ 2\))
$x_2$ : number of bedroom (0~5)

From the preceding example, if you have these two variables $x_1$ and $x_2$ and plot the contours of cost function.
[plot 1]

Then your contour may look like this, a very tall and skin ellipses in the [plot 1], and if you run the gradient descent on this cost function, it may oscillate back and forth and take a long time to get the global minimum.

So, as we mentioned before, we can scale these variables into the similar range.
$x_1 :=  \frac{x_1}{2000}$
$x_2 :=  \frac{x_2}{5}$
This method will let these variables' range between 0 and 1.
[plot 2]

Then the contours may less skewed and look like circles in the [plot 2], and when you run gradient descent on this cost function again, the path to the global minimum will be much more directly and save the iterating time.

Generally, we want to set these variables into approximately -1 ~ +1, but these two numbers -1 and +1 are not so important, the key point is letting these variables on the similar scale, so if the range end up with -2 ~ +0.5, it is OK.

[Mean normalization]

[Ex]
$x_1$ : size (0~2000 \(feet ^ 2\))
$x_2$ : number of bedroom (0~5)
$\mu_1$ : 1000
$\mu_2$ : 2

[Def]
$$\frac{x_i - \mu_i}{s_i}$$

The $s_i$ here means the range of $x_i$, and it can be the standard deviation of $x_i$ , too. In my explanation, the formula of range is more simpler than standard deviation and the performance is not far from.

At last, the feature scaling method doesn't have to be too exact to get the global minimum faster.




No comments:

Post a Comment