Introduction
In previous posts we discussed the univariate linear regression model and how we can implement the model in python.
We have seen how we can fit a line, ˆy=a0+a1x, to a dataset of given points, and how linear regression techniques estimate the values of a0 and a1 using the cost functions. We have seen that the residual is the difference between the observed values and the predicted values, that is, for any point i,
ei=yi−^yi
We have looked at the Mean Square Error, the sum of the squared residuals divided by the number of points; hence our objective is to make the aggregation of residuals as small as possible.
argmina0,a1∑ni=1(yi−a0−a1xi)2n
we have seen that when we differentiate the cost function with respect to a0 and a1,
a0=ˉy−a1ˉx
and
a1=∑ni=1xiyi−nˉxˉy∑ni=1x2i−nˉx2
Multi-variable Case
Most real-world problems have multiple features, and therefore our approximation is a hyperplane, which is a linear combination of the features, expressed as
ˆy=a0+a1x1+a2x2+⋯+anxn
Hence if we define, Y=(y1y2⋮yn)
X=(1x11x12…x1m1x21x22…x2m⋮⋮⋮⋮⋮1xn1xn2…xnm)
β=(a0a1⋮an)
then,
ˆY=Xβ
the residuals
E=(e1e2⋮en)=(y1−ˆy1y2−ˆy2⋮yn−ˆyn)= Y−ˆY
We will here introduce the residual sum-of-squares cost function, which is very similar to the mean square error cost function, but it is defined as
RSS=n∑i=1e2i
We have noticed in the previous cases that the effect of considering the mean is eliminated during the derivation of the cost function and equating to zero.
we also notice that
RSS=ETE=(Y−ˆY)T(Y−ˆY)=(Y−Xβ)T(Y−Xβ)=YTY−YTXβT−XTY+βTXTXβ
Matrix Differentiation
Before we continue, we will first remind ourselves of the following:
If we are given two independent matrices x, and A, where x is an m by 1 matrix and A is an n by m matrix, then;
for y=A → dydx=0,
for y=Ax → dydx=A,
for y=xA → dydx=AT,
for y=xTAx → dydx=2xTA,
Hence, differentiating the cost function with respect to β,
∂RSS∂β=0−YTX−(XTY)T+2βTXTX=−YTX−YTX+2βTXTX=−2YTX+2βTXTX
for minimum RSS, ∂RSS∂β=0, hence
2βTXTX=2YTXβTXTX=YTXβT=YTX(XTX)−1
β=(XTX)−1XTY
Two-variable case equations
For the scenario where we have only 2 features, so that ˆy=a0+a1x1+a2x2, we can obtain the following equations for the parameters a0, a1 and a2:
a1=∑ni=1X22i∑ni=1X1iyi−∑ni=1X1iX2i∑ni=1X2iyi∑ni=1X21i∑ni=1X22i−(∑ni=1X1iX2i)2
a2=∑ni=1X21i∑ni=1X2iyi−∑ni=1X1ix2i∑ni=1X1iyi∑ni=1X21i∑ni=1X22i−(∑ni=1X1iX2i)2
and
a0=ˉY−a1ˉX1−a2ˉX2
where n∑i=1X21i=n∑i=1x21i−∑ni=1x21in
n∑i=1X21i=n∑i=1x21i−∑ni=1x21in
n∑i=1X1iyi=n∑i=1x1in∑i=1yi−∑ni=1x1i∑ni=1yin
n∑i=1X2iyi=n∑i=1x2in∑i=1yi−∑ni=1x2i∑ni=1yin
n∑i=1X1iX2i=n∑i=1x1in∑i=1x2i−∑ni=1x1i∑ni=1x2in
It is evident that finding the parameters becomes more difficult as we add more features.