Regression from Scratch

Python ITB
Makers Institute, Jalan Kyai Gede Utama No.11, Dago

Tuesday, 17 April 2018

Contributors

Outline

  • Closed Form Solution
  • Gradient Descent

Closed Form Solution

Simple linear regression in a nutshell is a very direct approach to predict response Y based on X. It assumes that there are linear relationship between X and Y. Mathematically, this is defined as

$$Y \approx b_0 + b_1X $$

Here, $b_0$ corresponds to intercept while $b_1$ corresponds to slope/gradient. Now, our goal is to estimate $b_0$ and $b_1$ so the plot looks 'nice'.

In mathematical notation, this is commonly referred as RSS.

RSS is simply the sum of squared difference between target $y$ and estimated value $\hat{y}$

$$\sum(y-\hat{y})^2$$

Remember that $\hat{y}$ is equal to $b_0+b_1x$ so we can expand this further into $$\sum(y-(b_0+b_1x))^2$$

Our goal is to find the 'correct' $b_0$ and $b_1$ so that RSS is minimized. Meanwhile, the plot of RSS vs $b_0$ and $b_1$ might look like this

The great thing about this RSS function is that it always have a global minimum. This is confirmed by calculus by setting the derivation of the RSS function to zero and directly find the $b_0$ and $b_1$.

However, this value could correspond either to global maximum or minimum, we don't know for sure. After we have computed the second derivative we would have confirmed that the RSS function is a convex function, which means that it must have a unique, global minimum.

But, RSS is a special case where the partial derivation can be set to zero so we could obtain the 'correct' parameter. More often than not, we could not do this. So take this closed form solution with a little bit grain of salt

Now, we would derive RSS with respect to (w.r.t.) $b_0$ and $b_1$ before we set them both to 0 and find the parameter $b_0$ and $b_1$

Let's now call RSS function as a cost function or a loss function. In a nutshell, its the function we strive to maximize or minimize. And in this specific case, we want to minimize the difference the RSS. In other words, what we're looking for is the $b_0$ and $b_1$ that would minimize the RSS.

Intuitively, we would always find the minimum of cost function, but it turns out that in some cases, i.e logistic regression we would find the maximum value of the cost function. It's a little bit off topic, but a little bit of anticipation is always nice.

\begin{eqnarray} \frac{\partial C}{\partial b_0} &=& -2 \sum(y-b_0-b_1x)\nonumber\\ &=& -2 \sum(y-(b_0+b_1x))\nonumber\\ \end{eqnarray}

\begin{eqnarray} \frac{\partial C}{\partial b_1} &=& -2 \sum(y-b_0-b_1x)(x)\nonumber\\ &=& -2 \sum(y-(b_0+b_1x))(x)\nonumber\\ \end{eqnarray}

Set the first equation to zero

\begin{eqnarray} 0 &=& -2 \sum(y-b_0-b_1x)\nonumber\\ &=& \sum(y) - \sum(b_0) - \sum(b_1)x\nonumber\\ b_0 &=& \frac{y}{n} - \frac {b_1 \sum(x)}{n}\nonumber\\ &=& \overline{y} - b_1\overline{x}\nonumber\\ \end{eqnarray}

Set the second equation to zero

\begin{eqnarray} 0 &=& -2 \sum y-b_0-b_1xx\nonumber\\ &=& \sum yx - \sum b_0x - \sum b_1x^2\nonumber\\ b_1\sum x^2 &=& \sum yx - b_0\sum x\nonumber\\ \end{eqnarray}

But recall that $b_0 = \overline{y} - b_1\overline{x}$

\begin{eqnarray} b_1\sum(x^2) &=& \sum(yx) - (\overline{y} - b_1\overline{x})\sum(x)\nonumber\\ &=& \sum yx - (\frac {\sum y}{n} - b_1 \frac{\sum x}{n})\sum x \nonumber\\ &=& \sum yx - \frac{1}{n} \sum y \sum x + \frac{1}{n} b1 \sum x \sum x\nonumber\\ b_1 (\sum x^2 - \frac {1}{n} \sum x \sum x) &=& \sum yx - \frac{1}{n} \sum y \sum x\nonumber\\ b_1 &=& \frac {\sum xy - \frac{1}{n} \sum x \sum y} {\sum x^2 - \frac {1}{n} \sum x \sum x}\nonumber\\ \end{eqnarray}

However, most of the times, we are not able to do this, since the cost function is much more complicated. That is why we would use Gradient Descent most of the time.

Gradient Descent

In closed loop solution, we set the derivative of RSS w.r.t to each variable to zero. In a more technical term, the matrix of these derivatives is called as Jacobian matrix $\vec J$. However, most function couldn't be treated this way. They are just so complicated that it is virtually impossible to find the global minimum / maximum by setting it to zero. So, we're going to use a very powerful algorithm called Gradient Descent.

Let us formally define $\vec J$ as

\begin{equation*} \vec J = \begin{pmatrix} \frac{\partial C}{\partial b_0}\\ \frac{\partial C}{\partial b_1}\\ \end{pmatrix} \end{equation*}

Now the question, how could we attain this?

Let's go back to the image of RSS vs $b_0$ and $b_1$.

However, that plot is too convoluted for our purpose. Suppose that we only have one parameter $w$ that would decide the value of our cost function $C$. Then the plot would be

The y-axis and x-axis denotes the RSS value and $w$ respectively. If you notice that the global minimum is located exactly in the middle. Our goal is to estimate that $w$ so that RSS is globally minimized.

The idea of gradient descent is that, instead of directly finding the $w$, we're going to iteratively change the amount of $w$ until RSS is minimized. We would step down the hill until we reach the lowest point. Now the question would be, how do we know the direction so we would arrive at the global minimum?

The answer lies in the plot. Notice that if we are standing in the left side of the minimum, when the gradient is negative, then we should increase the $w$. On the other hand, suppose that we're located on the right side, when the gradient is positive, then we should go back, or in other words, decrease the $w$.

This idea is expressed beautifully in gradient descent algorithm

\begin{eqnarray} w^{t+1} &=& w^t-\alpha \vec J\nonumber\\ &=& w^t-\alpha \frac{\partial C}{\partial w}\nonumber\\ \end{eqnarray}

As in our case \begin{eqnarray} w^{t+1} &=& w^t-\alpha \frac{\partial C}{\partial w}\nonumber\\ &=& w^t-\alpha \begin{pmatrix} \frac{\partial C}{\partial b_0}\\ \frac{\partial C}{\partial b_1}\\ \end{pmatrix}\\ &=& w^t-\alpha \begin{pmatrix} -2 \sum(y-(b_0+b_1x))\nonumber\\\\ -2 \sum(y-b_0-b_1x)(x)\nonumber\\ \end{pmatrix} \end{eqnarray}

$\alpha$ denotes the timesteps or how big our leap each time we move. Later on we would show that if we set $\alpha$ too big, then we won't find the global minimum

References

Thank you!