Linear Regression Part 9 - Mini Batch Gradient Descent

Mon February 14, 2022
machine-learning linear-regression gradient-descent python

In the last post we compared the stochastic gradient descent algorithm to the batch gradient descent algorithm that we has discussed in a previous post. We discussed that as the size of the training dataset increases, batch gradient descent, where we use all the examples of the training set in each iteration, becomes very computationally expensive and that we can therefore use stochastic gradient descent, where we use one example of the training set in each iteration, to have a more efficient way to approach the coefficients of our hypothesis function.

In this post we will discuss mini-batch gradient descent, where we use a number $k$ of the training set examples in each iteration, which is a variation of the thoughts of stochastic gradient descent. We will discuss the general idea of mini-batch gradient descent and how to implement it in Python.

Therefore we can consider our training dataset as a collection of $m/k$ mini batches;

$$\textbf{X} = \begin{pmatrix} x_0^{(1)} &\dots & x_0^{(k)} & x_0^{(k+1)} & \dots & x_0^{(2k)} & \dots \dots & x_0^{(m)}\\
x_1^{(1)} &\dots & x_1^{(k)} & x_1^{(k+1)} & \dots & x_1^{(2k)} & \dots \dots & x_1^{(m)}\\
\vdots & & \vdots & \vdots & & \vdots & & \vdots \\
x_n^{(1)} &\dots & x_n^{(k)} & x_n^{(k+1)} & \dots & x_n^{(2k)} & \dots \dots & x_n^{(m)}\\
\end{pmatrix}$$ $$Dim:[n \times m]$$

Therefore the matrix $X$ can be represented by a matrix of mini batches,

$$\textbf{X} = \begin{pmatrix} X^{ \{ 1 \} } & X^{ \{ 2 \} } & \dots & X^{ \{ m/k \} }\\
\end{pmatrix}$$

where $X^{ \{ 1 \} }$ represents the first mini batch, $X^{ \{ 2 \} }$ represents the second mini batch, and so on, such that;

$$X^{ \{ 1 \} } = \begin{pmatrix} x_0^{(1)} &\dots & x_0^{(k)} \\
x_1^{(1)} &\dots & x_1^{(k)} \\
\vdots & & \vdots \\
x_n^{(1)} &\dots & x_n^{(k)} \\
\end{pmatrix}$$ $$Dim:[n \times k]$$

Similarly for the $Y$ vector,

$$\textbf{Y} = \begin{pmatrix} Y^{ \{ 1 \} } & Y^{ \{ 2 \} } & \dots & Y^{ \{ m/k \} }\\
\end{pmatrix}$$

Calculating the hypothesis function for a mini-batch of training data, we can write the following equation:

We can now calculate the hypothesis function for the first mini batch as a matrix multiplication:

$$\begin{pmatrix} \hat{y}^{(1)} \\
\hat{y}^{(2)} \\
\vdots \\
\hat{y}^{(k)} \end{pmatrix} = \begin{pmatrix} x_0^{(1)} & x_1^{(1)} & \dots & x_n^{(1)} \\
x_0^{(2)} & x_1^{(2)} & \dots & x_n^{(2)} \\
\vdots & \vdots & \vdots & \vdots \\
x_0^{(k)} & x_1^{(k)} & \dots & x_n^{(k)} \end{pmatrix} \begin{pmatrix} a_0 \\
a_1 \\
\vdots \\
a_n \end{pmatrix} $$ $$Dim:[k \times n] \cdot [n \times 1]$$

$$ \hat{Y}^{ \{ 1 \}} = \left( {X}^{ \{ 1 \}} \right)^T \beta$$

We can therefore update the coefficients of our hypothesis function from this mini-batch as follows:

$$\begin{pmatrix} a_0 \\
a_1 \\
\vdots \\
a_n \end{pmatrix} := \begin{pmatrix} a_0 \\
a_1 \\
\vdots \\
a_n \end{pmatrix} - \frac{\alpha}{k} \begin{pmatrix} x_0^{(1)} & \dots & x_0^{(k)}\\
x_1^{(1)} & \dots & x_1^{(k)}\\
\vdots&&\vdots\\
x_n^{(1)} & \dots & x_n^{(k)}\\
\end{pmatrix} \cdot \begin{pmatrix} \hat{y}^{(1)} - {y}^{(1)} \\
\hat{y}^{(2)} - {y}^{(2)} \\
\vdots \\
\hat{y}^{(k)} - {y}^{(k)} \end{pmatrix} $$ $$Dim:[n \times k] \cdot [k \times 1]$$

or

$$\beta := \beta -\frac{\alpha}{k} \textbf{X} \cdot \textbf{R}$$

The cost function for the mini-batch gradient descent algorithm is:

$$\textbf{J} = \frac{1}{k} \sum_{i=1}^{k} (\hat{y}^{(i)} - y^{(i)})^2 $$

We notice that this is very similar to batch gradient descent, but in this case we are considering only a batch of $k$ examples of the training set. In this case, however an epoch will update the coefficients of our hypothesis function $m/k$ times instead of only once as in the case of batch gradient descent.

The following code will implement mini-batch gradient descent.

import os
import matplotlib
matplotlib.rcParams['text.usetex'] = True
import matplotlib.pyplot as plt
import pandas as pd
import sys
import numpy as np

def minibatch_gradient_descent(file:str, alpha:float=0.0023, batch_size:int=100, epochs_threshold:int=100000, costdifference_threshold:float=0.00001, plot:bool=False):

    # load the training data
    full_filename = os.path.join(os.path.dirname(__file__), file)
    training_data = pd.read_csv(full_filename, delimiter=',', header=0, index_col=False)

    # training_data = training_data.sample(frac=1).reset_index(drop=True)

    # divide the data into features and labels
    X = training_data.drop(['y'], axis=1).to_numpy()
    # add a column of ones to the features matrix to account for the intercept, a0
    X = np.insert(X, 0, 1, axis=1)

    Y = training_data['y'].to_numpy()
    
    # length of the training data
    m = len(Y)
    print(f'Length of the training data: {m}')

    # initialize the y_hat vector to 0
    y_hat = np.zeros(len(Y))
    
    # beta will hold the values of the coefficients, hence it will be  the size 
    # of a row of the X matrix
    # initialize beta to random values
    beta = np.random.random(len(X[0]))

    # minibatches setting
    # number of minibatches = m => stochastic gradient descent
    # number of minibatches = 1 => batch gradient descent
    minibatch_size = int(m/batch_size)

    # initialize the number of epochs
    epoch_count = 0

    # initialize the previous cost function value to a large number
    # previous_cost = sys.float_info.max
    
    # store the cost function and a2 values for plotting
    costs = []
    a_2s = []
    
    previous_cumulative_cost = sys.float_info.max
    
    # loop until exit condition is met
    while True:

        cumulative_cost = 0

        for i in range(batch_size):

            # print(f'Minibatch: {i}')
            minibatch_X = X[i*minibatch_size:(i+1)*minibatch_size]
            minibatch_Y = Y[i*minibatch_size:(i+1)*minibatch_size]

            # calculate the hypothesis function for all training data
            y_hat = np.dot(beta, minibatch_X.T)
            #  calculate the residuals
            residuals = y_hat - minibatch_Y
            
            
            # calculate the new value of beta
            beta -= ( alpha / minibatch_size)  * np.dot(residuals, minibatch_X)

            # calculate the cost function
            cost = np.dot(residuals, residuals) / ( 2 * minibatch_size)

            cumulative_cost += cost

        # increase the number of iterations
        epoch_count += 1

        # record the cost and a1 values for plotting
        #     costs.append(cost)
        #     a_2s.append(__beta[2])

        cost_difference = previous_cumulative_cost - cumulative_cost
        # print(f'Epoch: {epochs}, average cost: {(cumulative_cost/minibatches_number):.3f}, beta: {beta}')
        previous_cumulative_cost = cumulative_cost

        # check if the cost function is diverging, if so, break
        # if cost_difference < 0:
        #     print(f'Cost function is diverging. Stopping training.')
        #     break
            
        # check if the cost function is close enough to 0, if so, break or if the number of 
        # iterations is greater than the threshold, break
        if abs(cost_difference) < costdifference_threshold or epoch_count > epochs_threshold:
            break

    # # plot the cost function and a1 values
    # plt.plot(a_2s[3:], costs[3:], '--bx', color='lightblue', mec='red')
    # plt.xlabel('a2')
    # plt.ylabel('cost')
    # plt.title(r'Cost Function vs. a1, with $\alpha$ =' + str(__alpha))
    # plt.show()
    
    # calculate the cost for the training data and return the beta values and 
    # the number of iterations and the cost
    y_hat = np.dot(beta, X.T)
    residuals = y_hat - Y
    cost = np.dot(residuals, residuals) / ( 2 * m)
    
    return beta, epoch_count, cost
    

if __name__ == '__main__':

    from timeit import default_timer as timer

    file = 'data.csv'
    alpha = 0.00023
    epochs_threshold = 1000
    costdifference_threshold = 0.00001
    plot = False
    batch_size = 100


    start = timer()
    beta, epoch_count, cost = minibatch_gradient_descent(file, alpha, batch_size, epochs_threshold, costdifference_threshold, plot)
    end = timer()
    print(f'Time: {end - start} beta: {beta}, epoch_count: {epoch_count}, cost: {cost}')



Paper Implementation - Uncertain rule-based fuzzy logic systems Introduction and new directions-Jerry M. Mendel; Prentice-Hall, PTR, Upper Saddle River, NJ, 2001,    555pp., ISBN 0-13-040969-3. Example 9-4, page 261

October 8, 2022
type2-fuzzy type2-fuzzy-library fuzzy python IT2FS paper-workout

Notes about Azure ML, Part 10 - An end-to-end AzureML example; Model Optimization

Creation and execution of an AzureML Model Optimization Experiment
machine-learning azure ml hyperparameter tuning model optimization

Notes about Azure ML, Part 9 - An end-to-end AzureML example Pipeline creation and execution

Creation and execution of a multi-step AzureML pipeline the selects the best model for a given dataset.
machine-learning azure ml pipeline
comments powered by Disqus


machine-learning 25 python 21 fuzzy 14 hugo_cms 11 azure-ml 10 linear-regression 10 gradient-descent 9 type2-fuzzy 8 type2-fuzzy-library 8 type1-fuzzy 5 cnc 4 dataset 4 datastore 4 it2fs 4 excel 3 paper-workout 3 r 3 c 2 c-sharp 2 experiment 2 iot 2 programming 2 robotics 2 weiszfeld_algorithm 2 arduino 1 automl 1 classifier 1 computation 1 cost-functions 1 development 1 embedded 1 fuzzy-logic 1 game 1 hyperparameter-tuning 1 javascript 1 learning 1 mathjax 1 maths 1 model-optimization 1 mxchip 1 pandas 1 pipeline 1 random_walk 1 roc 1 tools 1 vscode 1 wsl 1