Linear Regression Part 8 - Stochastic Gradient Descent

Mon January 31, 2022

machine-learning linear-regression gradient-descent python

In a previous post, we have discussed the gradient descent algorithm for linear regression applied to multiple features. As the size of the training dataset increases, gradient descent becomes very computationally expensive. In particular, if we consider the computation of the cost function,

$$J = \frac{1}{2m} \sum_{i=1}^m \left( \hat{y}^{(i)} - {y}^{(i)} \right)^2$$,

which is a calculation that we need to compute for each training data iteration; we can can see that the cost function is expensive as m increases.

Stochastic gradient descent is a variant of gradient descent where coefficients are updated after each observation, and it is therefore better suited for large datasets. Therefore if we consider an observation $i$ of the training dataset, we can calculate the cost of the observation, or how well our hypothesis function predicts the observation as,

$$Cost(\beta, (x^{(i)}, y^{(i)})) = \frac{1}{2} \left(\hat{y}^{(i)} - {y}^{(i)} \right)^2$$

We can therefore calculate the cost function as

$$J(\beta) = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}^{(i)} - {y}^{(i)} \right)^2$$

Random Gradient Descent can be implemented as follows:

Shuffle the training dataset, thus making sure that the order of the observations is random.
For each observation:
- $$\begin{pmatrix} a_0 \\
  a_1 \\
  \vdots \\
  a_n \end{pmatrix} := \begin{pmatrix} a_0 \\
  a_1 \\
  \vdots \\
  a_n \end{pmatrix} - \alpha \begin{pmatrix} x_0^{(i)}\\
  x_1^{(i)}\\
  \vdots\\
  x_n^{(i)} \end{pmatrix} \cdot \begin{pmatrix} \hat{y}^{(i)} - {y}^{(i)} \end{pmatrix} $$

Where $m$ is the number of training data points.

We can execute the above equation for a fixed number of iterations, where convergence is usually achieved. The code to implement this algorithm is as follows:


import os
import matplotlib
matplotlib.rcParams['text.usetex'] = True
import matplotlib.pyplot as plt
import pandas as pd
import sys
import numpy as np


def stochastic_gradient_descent(file, alpha, max_epochs = 5):
    '''
    Implementation of Stochastic Gradient Descent
    
    Exit condition: max_epochs epochs
    '''
    
    np.random.seed(42)

    # load the training data
    full_filename = os.path.join(os.path.dirname(__file__), file)
    data_set = pd.read_csv(full_filename, delimiter=',', header=0, index_col=False)

    # training_data = training_data.sample(frac=1).reset_index(drop=True)

    # divide the data into features and labels
    X = data_set.drop(['y'], axis=1).to_numpy()
    
    # add a column of ones to the features matrix to account for the intercept, a0
    X = np.insert(X, 0, 1, axis=1)

    Y = data_set['y'].to_numpy()
    
    # length of the training data
    m = len(Y)

    # initialize the y_hat vector to 0
    y_hat = np.zeros(len(Y))
    
    # beta will hold the values of the coefficients, hence it will be  the size 
    # of a row of the X matrix
    # initialize beta to random values
    beta = np.random.random(len(X[0]))

    # initialize the number of epochs
    epochs = 0

    # loop until exit condition is met
    while True:
        
        i = np.random.randint(0, m)

        # print(f'Minibatch: {i}')
        x = X[i]
        y = Y[i]

        # calculate the hypothesis function for all training data
        y_hat = np.dot(beta, x.T)

        #  calculate the residuals
        residuals = y_hat - y

        # calculate the new value of beta
        beta -= (alpha * residuals * x)

        epochs += 1
  
        # check if the cost function is close enough to 0, if so, break or if the number of 
        # iterations is greater than the threshold, break
        if epochs > (m*max_epochs):
            break
    
    # calculate the cost for the training data and return the beta values and 
    # the number of iterations and the cost
    y_hat = np.dot(beta, X.T)
    residuals = y_hat - Y
    cost = np.dot(residuals, residuals) / ( 2 * m)
    
    return beta, cost


if __name__ == '__main__':

    from timeit import default_timer as timer

    file = 'data.csv'
    alpha = 0.0001
    max_epochs = 5
    start = timer()
    beta, cost = stochastic_gradient_descent(file, alpha, max_epochs) 
    end = timer()
    print(f'Time: {end - start}, beta: {beta}, cost: {cost}')

Exit Criteria Considerations

A critical consideration in the above algorithm is that the aspect of convergence measurement is wholly taken out, and the algorithm terminates after a fixed number of epochs. A better approach is to reserve a portion of the data for validation and then use the remaining data for training. The validation set is then used to calculate the cost function and the algorithm terminates when the cost function converges. An implementation of this approach is as follows:


import os
import matplotlib
matplotlib.rcParams['text.usetex'] = True
import pandas as pd
import sys
import numpy as np


def stochastic_gradient_descent(file, alpha=0.0023, epochs_threshold=100, costdifference_threshold=0.00001, plot=False):
    '''
    Implementation of Stochastic Gradient Descent
    
    '''

    np.random.seed(42)
    
    # load the training data
    full_filename = os.path.join(os.path.dirname(__file__), file)
    data_set = pd.read_csv(full_filename, delimiter=',', header=0, index_col=False)

    # create train and test sets
    mask = np.random.rand(len(data_set)) < 0.8
    training_data = data_set[mask]
    validation_data = data_set[~mask]

    # divide the data into features and labels
    X_train = data_set.drop(['y'], axis=1).to_numpy()
    # add a column of ones to the features matrix to account for the intercept, a0
    X_train = np.insert(X_train, 0, 1, axis=1)
    Y_train = data_set['y'].to_numpy()

    X_validation = validation_data.drop(['y'], axis=1).to_numpy()
    X_validation = np.insert(X_validation, 0, 1, axis=1)
    Y_validation = validation_data['y'].to_numpy()


    # length of the training data
    m = len(Y_train)

    # initialize the y_hat vector to 0
    y_hat = np.zeros(len(Y_train))
    
    # beta will hold the values of the coefficients, hence it will be  the size 
    # of a row of the X matrix
    # initialize beta to random values
    beta = np.random.random(len(X_train[0]))

    # initialize the number of epochs
    count = 0

    previous_validation_cost = sys.float_info.max

    # loop until exit condition is met
    while True:
        
        i = np.random.randint(0, m)

        # print(f'Minibatch: {i}')
        x = X_train[i]
        y = Y_train[i]

        # calculate the hypothesis function for all training data
        y_hat = np.dot(beta, x.T)

        #  calculate the residuals
        residuals = y_hat - y

        # calculate the new value of beta
        beta -= (alpha * residuals * x)

        count += 1

        if count % 1000 == 0:
            y_hat_validation = np.dot(beta, X_validation.T)
            residuals_validation = y_hat_validation - Y_validation
            cost_validation = np.dot(residuals_validation, residuals_validation) / ( 2 * len(Y_validation))
            
            if abs(previous_validation_cost - cost_validation) < costdifference_threshold:
                break
            else:
                previous_validation_cost = cost_validation
            
            # uncomment this line to see details
            # print(f'Epoch: {count/m} Cost: {cost_validation} beta: {beta}')
                    
        # check if the cost function is close enough to 0, if so, break or if the number of 
        # iterations is greater than the threshold, break
        if (count/m) > (epochs_threshold):
            break
    
    # calculate the cost for the training data and return the beta values and 
    # the number of iterations and the cost
    y_hat = np.dot(beta, X_train.T)
    residuals = y_hat - Y_train
    cost = np.dot(residuals, residuals) / ( 2 * m)
    
    return beta, count, cost


if __name__ == '__main__':

    from timeit import default_timer as timer

    file = 'data.csv'
    alpha = 0.00033
    epochs_threshold = 100
    costdifference_threshold = 0.005
    plot = False


    start = timer()
    beta, count, cost = stochastic_gradient_descent(file, alpha, epochs_threshold, costdifference_threshold, plot)
    end = timer()
    print(f'Time: {end - start}, beta: {beta}, count: {count}, cost: {cost}')

Conclusion

In this post, we have seen how the coefficients of the linear regression model can be optimized using the stochastic gradient descent algorithm. We have also seen different ways to determine exit criteria for the algorithm. We have also seen how to implement the algorithm in Python.