## Linear Regression Part 8 - Stochastic Gradient Descent

##### Mon January 31, 2022

In a previous post, we have discussed the gradient descent algorithm for linear regression applied to multiple features. As the size of the training dataset increases, gradient descent becomes very computationally expensive. In particular, if we consider the computation of the cost function,

$$J = \frac{1}{2m} \sum_{i=1}^m \left( \hat{y}^{(i)} - {y}^{(i)} \right)^2$$,

which is a calculation that we need to compute for each training data iteration; we can can see that the cost function is expensive as m increases.

Stochastic gradient descent is a variant of gradient descent where coefficients are updated after each observation, and it is therefore better suited for large datasets. Therefore if we consider an observation $i$ of the training dataset, we can calculate the cost of the observation, or how well our hypothesis function predicts the observation as,

$$Cost(\beta, (x^{(i)}, y^{(i)})) = \frac{1}{2} \left(\hat{y}^{(i)} - {y}^{(i)} \right)^2$$

We can therefore calculate the cost function as

$$J(\beta) = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}^{(i)} - {y}^{(i)} \right)^2$$

Random Gradient Descent can be implemented as follows:

• Shuffle the training dataset, thus making sure that the order of the observations is random.

• For each observation:

• $$\begin{pmatrix} a_0 \\ a_1 \\ \vdots \\ a_n \end{pmatrix} := \begin{pmatrix} a_0 \\ a_1 \\ \vdots \\ a_n \end{pmatrix} - \alpha \begin{pmatrix} x_0^{(i)}\\ x_1^{(i)}\\ \vdots\\ x_n^{(i)} \end{pmatrix} \cdot \begin{pmatrix} \hat{y}^{(i)} - {y}^{(i)} \end{pmatrix}$$

Where $m$ is the number of training data points.

We can execute the above equation for a fixed number of iterations, where convergence is usually achieved. The code to implement this algorithm is as follows:


import os
import matplotlib
matplotlib.rcParams['text.usetex'] = True
import matplotlib.pyplot as plt
import pandas as pd
import sys
import numpy as np

def stochastic_gradient_descent(file, alpha, max_epochs = 5):
'''

Exit condition: max_epochs epochs
'''

np.random.seed(42)

full_filename = os.path.join(os.path.dirname(__file__), file)

# training_data = training_data.sample(frac=1).reset_index(drop=True)

# divide the data into features and labels
X = data_set.drop(['y'], axis=1).to_numpy()

# add a column of ones to the features matrix to account for the intercept, a0
X = np.insert(X, 0, 1, axis=1)

Y = data_set['y'].to_numpy()

# length of the training data
m = len(Y)

# initialize the y_hat vector to 0
y_hat = np.zeros(len(Y))

# beta will hold the values of the coefficients, hence it will be  the size
# of a row of the X matrix
# initialize beta to random values
beta = np.random.random(len(X[0]))

# initialize the number of epochs
epochs = 0

# loop until exit condition is met
while True:

i = np.random.randint(0, m)

# print(f'Minibatch: {i}')
x = X[i]
y = Y[i]

# calculate the hypothesis function for all training data
y_hat = np.dot(beta, x.T)

#  calculate the residuals
residuals = y_hat - y

# calculate the new value of beta
beta -= (alpha * residuals * x)

epochs += 1

# check if the cost function is close enough to 0, if so, break or if the number of
# iterations is greater than the threshold, break
if epochs > (m*max_epochs):
break

# calculate the cost for the training data and return the beta values and
# the number of iterations and the cost
y_hat = np.dot(beta, X.T)
residuals = y_hat - Y
cost = np.dot(residuals, residuals) / ( 2 * m)

return beta, cost

if __name__ == '__main__':

from timeit import default_timer as timer

file = 'data.csv'
alpha = 0.0001
max_epochs = 5
start = timer()
beta, cost = stochastic_gradient_descent(file, alpha, max_epochs)
end = timer()
print(f'Time: {end - start}, beta: {beta}, cost: {cost}')


### Exit Criteria Considerations

A critical consideration in the above algorithm is that the aspect of convergence measurement is wholly taken out, and the algorithm terminates after a fixed number of epochs. A better approach is to reserve a portion of the data for validation and then use the remaining data for training. The validation set is then used to calculate the cost function and the algorithm terminates when the cost function converges. An implementation of this approach is as follows:


import os
import matplotlib
matplotlib.rcParams['text.usetex'] = True
import pandas as pd
import sys
import numpy as np

def stochastic_gradient_descent(file, alpha=0.0023, epochs_threshold=100, costdifference_threshold=0.00001, plot=False):
'''

'''

np.random.seed(42)

full_filename = os.path.join(os.path.dirname(__file__), file)

# create train and test sets

# divide the data into features and labels
X_train = data_set.drop(['y'], axis=1).to_numpy()
# add a column of ones to the features matrix to account for the intercept, a0
X_train = np.insert(X_train, 0, 1, axis=1)
Y_train = data_set['y'].to_numpy()

X_validation = validation_data.drop(['y'], axis=1).to_numpy()
X_validation = np.insert(X_validation, 0, 1, axis=1)
Y_validation = validation_data['y'].to_numpy()

# length of the training data
m = len(Y_train)

# initialize the y_hat vector to 0
y_hat = np.zeros(len(Y_train))

# beta will hold the values of the coefficients, hence it will be  the size
# of a row of the X matrix
# initialize beta to random values
beta = np.random.random(len(X_train[0]))

# initialize the number of epochs
count = 0

previous_validation_cost = sys.float_info.max

# loop until exit condition is met
while True:

i = np.random.randint(0, m)

# print(f'Minibatch: {i}')
x = X_train[i]
y = Y_train[i]

# calculate the hypothesis function for all training data
y_hat = np.dot(beta, x.T)

#  calculate the residuals
residuals = y_hat - y

# calculate the new value of beta
beta -= (alpha * residuals * x)

count += 1

if count % 1000 == 0:
y_hat_validation = np.dot(beta, X_validation.T)
residuals_validation = y_hat_validation - Y_validation
cost_validation = np.dot(residuals_validation, residuals_validation) / ( 2 * len(Y_validation))

if abs(previous_validation_cost - cost_validation) < costdifference_threshold:
break
else:
previous_validation_cost = cost_validation

# uncomment this line to see details
# print(f'Epoch: {count/m} Cost: {cost_validation} beta: {beta}')

# check if the cost function is close enough to 0, if so, break or if the number of
# iterations is greater than the threshold, break
if (count/m) > (epochs_threshold):
break

# calculate the cost for the training data and return the beta values and
# the number of iterations and the cost
y_hat = np.dot(beta, X_train.T)
residuals = y_hat - Y_train
cost = np.dot(residuals, residuals) / ( 2 * m)

return beta, count, cost

if __name__ == '__main__':

from timeit import default_timer as timer

file = 'data.csv'
alpha = 0.00033
epochs_threshold = 100
costdifference_threshold = 0.005
plot = False

start = timer()
beta, count, cost = stochastic_gradient_descent(file, alpha, epochs_threshold, costdifference_threshold, plot)
end = timer()
print(f'Time: {end - start}, beta: {beta}, count: {count}, cost: {cost}')


### Conclusion

In this post, we have seen how the coefficients of the linear regression model can be optimized using the stochastic gradient descent algorithm. We have also seen different ways to determine exit criteria for the algorithm. We have also seen how to implement the algorithm in Python.

#### Paper Implementation - Uncertain rule-based fuzzy logic systems Introduction and new directions-Jerry M. Mendel; Prentice-Hall, PTR, Upper Saddle River, NJ, 2001,    555pp., ISBN 0-13-040969-3. Example 9-4, page 261

##### October 8, 2022
type2-fuzzy type2-fuzzy-library fuzzy python IT2FS paper-workout

#### Notes about Azure ML, Part 10 - An end-to-end AzureML example; Model Optimization

##### Creation and execution of an AzureML Model Optimization Experiment
machine-learning azure ml hyperparameter tuning model optimization

#### Notes about Azure ML, Part 9 - An end-to-end AzureML example Pipeline creation and execution

##### Creation and execution of a multi-step AzureML pipeline the selects the best model for a given dataset.
machine-learning azure ml pipeline