## Liner Regression, Part 4 - The Multi-variable scenario

### Introduction

In previous posts we discussed the univariate linear regression model and how we can implement the model in python.

We have seen how we can fit a line, $\hat{y} = a_0 + a_1 x$, to a dataset of given points, and how linear regression techniques estimate the values of $a_0$ and $a_1$ using the cost functions. We have seen that the residual is the difference between the observed values and the predicted values, that is, for any point $i$,

$$e_i = y_i - \hat{y_i}$$

We have looked at the Mean Square Error, the sum of the squared residuals divided by the number of points; hence our objective is to make the aggregation of residuals as small as possible.

$$argmin_{a_0, a_1} \frac{\sum_{i=1}^{n} (y_i-a_0-a_1 x_i)^{2} }{n}$$

we have seen that when we differentiate the cost function with respect to $a_0$ and $a_1$,

$$a_0= \bar{y} - a_1 \bar{x}$$

and

$$a_1 =\frac{ \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y}}{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2}$$

### Multi-variable Case

Most real-world problems have multiple features, and therefore our approximation is a hyperplane, which is a linear combination of the features, expressed as

$$\hat{y} = a_0 + a_1 x_1 + a_2 x_2 + \dots + a_n x_n$$

Hence if we define, $$\textbf{Y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}$$

$$\textbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \dots & x_{1m} \\ 1 & x_{21} & x_{22} & \dots & x_{2m} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{nm} \end{pmatrix}$$

$$\beta = \begin{pmatrix} a_0 \\ a_1 \\ \vdots \\ a_n \end{pmatrix}$$

then,

$$\hat{\textbf{Y}} = \textbf{X} \beta$$

the residuals

$\textbf{E} = \begin{pmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{pmatrix}$=$\begin{pmatrix} y_1 - \hat{y}_1 \\ y_2 - \hat{y}_2 \\ \vdots \\ y_n - \hat{y}_n \end{pmatrix}$= $\textbf{Y}-\hat{\textbf{Y}}$

We will here introduce the residual sum-of-squares cost function, which is very similar to the mean square error cost function, but it is defined as

$$RSS = \sum_{i=1}^{n} e_i^2$$

We have noticed in the previous cases that the effect of considering the mean is eliminated during the derivation of the cost function and equating to zero.

we also notice that

$$RSS = \textbf{E}^T \textbf{E}\\ = (\textbf{Y}-\hat{\textbf{Y}})^T(\textbf{Y}-\hat{\textbf{Y}})\\ = (\textbf{Y}- \textbf{X} \beta )^T (\textbf{Y}- \textbf{X} \beta )\\ = \textbf{Y}^T\textbf{Y}-\textbf{Y}^T\textbf{X} \beta^T - \textbf{X}^T \textbf{Y} + \beta^T\textbf{X}^T\textbf{X} \beta$$

Matrix Differentiation

Before we continue, we will first remind ourselves of the following:

If we are given two independent matrices $x$, and $A$, where $x$ is an m by 1 matrix and $A$ is an n by m matrix, then;

for $y=A$ $\rightarrow$ $\frac{dy}{dx}=0$,

for $y=Ax$ $\rightarrow$ $\frac{dy}{dx}=A$,

for $y=xA$ $\rightarrow$ $\frac{dy}{dx}=A^T$,

for $y=x^TAx$ $\rightarrow$ $\frac{dy}{dx}=2x^TA$,

Hence, differentiating the cost function with respect to $\beta$,

$$\frac{\partial RSS}{\partial\beta} = 0 -\textbf{Y}^T\textbf{X} - (\textbf{X}^T \textbf{Y})^T + 2 \beta^T\textbf{X}^T\textbf{X}\\ = -\textbf{Y}^T\textbf{X} - \textbf{Y}^T \textbf{X} + 2 \beta^T\textbf{X}^T\textbf{X}\\ = - 2 \textbf{Y}^T \textbf{X} + 2 \beta^T\textbf{X}^T\textbf{X}$$

for minimum $RSS$, $\frac{\partial RSS}{\partial\beta} = 0$, hence

$$2 \beta^T\textbf{X}^T\textbf{X} = 2 \textbf{Y}^T \textbf{X}\\ \beta^T\textbf{X}^T\textbf{X} = \textbf{Y}^T \textbf{X}\\ \beta^T = \textbf{Y}^T \textbf{X}(\textbf{X}^T\textbf{X})^{-1}\\$$ and therefore

$$\beta = (\textbf{X}^T\textbf{X})^{-1} \textbf{X}^T \textbf{Y}\\$$

### Two-variable case equations

For the scenario where we have only 2 features, so that $\hat{y} = a_0 + a_1 x_1 + a_2 x_2$, we can obtain the following equations for the parameters $a_0$, $a_1$ and $a_2$:

$$a_1 = \frac{ \sum_{i=1}^{n} X_{2i}^2 \sum_{i=1}^{n} X_{1i}y_i - \sum_{i=1}^{n} X_{1i}X_{2i} \sum_{i=1}^{n} X_{2i}y_i } {\sum_{i=1}^{n}X_{1i}^2 \sum_{i=1}^{n}X_{2i}^2 - (\sum_{i=1}^{n} X_{1i}X_{2i})^2}$$

$$a_2 = \frac{ \sum_{i=1}^{n} X_{1i}^2 \sum_{i=1}^{n} X_{2i}y_i - \sum_{i=1}^{n} X_{1i}x_{2i} \sum_{i=1}^{n} X_{1i}y_i } {\sum_{i=1}^{n}X_{1i}^2 \sum_{i=1}^{n}X_{2i}^2 - (\sum_{i=1}^{n} X_{1i}X_{2i})^2}$$

and

$$a_0 = \bar{\textbf{Y}} - a_1 \bar{\textbf{X}}_1 - a_2 \bar{\textbf{X}}_2$$

where $$\sum_{i=1}^{n} X_{1i}^2 = \sum_{i=1}^{n} x_{1i}^2 - \frac{\sum_{i=1}^{n} x_{1i}^2}{n}$$

$$\sum_{i=1}^{n} X_{1i}^2 = \sum_{i=1}^{n} x_{1i}^2 - \frac{\sum_{i=1}^{n} x_{1i}^2}{n}$$

$$\sum_{i=1}^{n} X_{1i}y_{i} = \sum_{i=1}^{n} x_{1i} \sum_{i=1}^{n} y_{i} - \frac{\sum_{i=1}^{n} x_{1i} \sum_{i=1}^{n} y_{i}}{n}$$

$$\sum_{i=1}^{n} X_{2i}y_{i} = \sum_{i=1}^{n} x_{2i} \sum_{i=1}^{n} y_{i} - \frac{\sum_{i=1}^{n} x_{2i} \sum_{i=1}^{n} y_{i}}{n}$$

$$\sum_{i=1}^{n} X_{1i}X_{2i} = \sum_{i=1}^{n} x_{1i} \sum_{i=1}^{n} x_{2i} - \frac{\sum_{i=1}^{n} x_{1i} \sum_{i=1}^{n} x_{2i}}{n}$$

It is evident that finding the parameters becomes more difficult as we add more features.

#### Paper Implementation - Uncertain rule-based fuzzy logic systems Introduction and new directions-Jerry M. Mendel; Prentice-Hall, PTR, Upper Saddle River, NJ, 2001,    555pp., ISBN 0-13-040969-3. Example 9-4, page 261

##### October 8, 2022
type2-fuzzy type2-fuzzy-library fuzzy python IT2FS paper-workout

#### Notes about Azure ML, Part 10 - An end-to-end AzureML example; Model Optimization

##### Creation and execution of an AzureML Model Optimization Experiment
machine-learning azure ml hyperparameter tuning model optimization

#### Notes about Azure ML, Part 9 - An end-to-end AzureML example Pipeline creation and execution

##### Creation and execution of a multi-step AzureML pipeline the selects the best model for a given dataset.
machine-learning azure ml pipeline