Notes about Azure ML, Part 7 - The Concrete Strength Example; training a model

Wed March 16, 2022
machine-learning azure ml experiment

In this series’s previous post, we have seen how to create and execute a machine learning experiment in AzureML. Our experiment was not an actual ML experiment but a simple script that printed a message.

This post will see how to create and execute an ML experiment that involves training a model. It is important to note that this post is just the first step in creating a proper ML pipeline in AzureML and is therefore not the best of solutions. In future posts, we will continue improving this process and build a better ML pipeline in AzureML.

This post will also show how we can provide parameters to the experiment, which is useful when we want to train a model with different hyperparameters. In this case, we pass the Dataset to the experiment as a parameter.

Situation

We will consider the Concrete Compressive Strength Data Set from UCI Machine Learning Repository. The data set contains the following features:

Feature Description Units Type
Cement (component 1)                    quantitative          kg in a m3 mixture           Input Variable
Blast Furnace Slag (component 2) quantitative kg in a m3 mixture Input Variable
Fly Ash (component 3) quantitative kg in a m3 mixture Input Variable
Water (component 4) quantitative kg in a m3 mixture Input Variable
Superplasticizer (component 5) quantitative kg in a m3 mixture Input Variable
Coarse Aggregate (component 6) quantitative kg in a m3 mixture Input Variable
Fine Aggregate (component 7) quantitative kg in a m3 mixture Input Variable
Age quantitative Day (1~365) Input Variable
Concrete compressive strength quantitative MPa Output Variable
   

The Dataset has, therefore:

There are many versions of this Dataset, and to load it directly, we have selected a csv version

Loading the data into an AzureML dataset

from azureml.core import Workspace, Dataset
import os
import azuremlproject.constants as constants

def setup_experiment():

    config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), '.azureml')
    ws = Workspace.from_config(path=config_path)

    concrete_data_path = 
    'https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/concrete.csv'

    data_set = Dataset.Tabular.from_delimited_files(concrete_data_path)
    data_set.register(
        workspace=ws,
        name=constants.DATASET_NAME,
        description='concrete dataset',
        create_new_version=True)


if __name__ == '__main__':
    setup_experiment()

We will create a dataset directly from a web file by providing the data URL and calling Dataset.Tabular.from_delimited_files. We then register the Dataset in the workspace, and we can use it in our experiment.

The experiment

We will be using scikit learn for this experiment. We will compare two regression algorithms to determine the best one for the concrete strength Dataset prediction. The process employed is relatively straightforward:

The code for this experiment is below. We should remember that the code is not the best way to create an ML pipeline in AzureML. In future posts, we will continue improving this process and build a better ML pipeline in AzureML.

'''
test and compare the predictive performance of various ML Models using the amount of explained variance 
(in percentage) as an evaluation metric. ML Models used for comparison are:

- Linear Regression
- SVR
'''

import argparse
from asyncio.proactor_events import constants
import pandas as pd
from sklearn.model_selection import train_test_split    # pyright: reportMissingImports=false
from sklearn.preprocessing import RobustScaler          # pyright: reportMissingImports=false
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from azureml.core.dataset import Dataset
from azureml.core.run import Run

run = Run.get_context()
ws = run.experiment.workspace

parser = argparse.ArgumentParser()
parser.add_argument(
    '--dataset_name',
    type=str,
    help='Name of the dataset to use'
)
args = parser.parse_args()

dataset = Dataset.get_by_name(ws, args.dataset_name)

df_orig = dataset.to_pandas_dataframe()

df_orig.columns = [
    'cement',
    'slag',
    'ash',
    'water',
    'superplastic',
    'coarseagg',
    'fineagg',
    'age',
    'strength']

# create features and labels datasets
X = df_orig.drop('strength',axis=1)
y = df_orig['strength']


print('X.shape:', X.shape)
print('y.shape:', y.shape)

# split into test and train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# create a pipeline for each regressor
# pipline will contain the regressor preceeded by a scaler
# all pipelines are stored in a list
pipelines = []
pipelines.append((
    'Linear Regression',
    Pipeline([('scaler',RobustScaler()),('LR',LinearRegression())])))
pipelines.append((
    'SupportVectorRegressor',
    Pipeline([('scaler',RobustScaler()),('SVR',SVR())])))


# Create empty dataframe to store the results
result_train = pd.DataFrame({'Regressor':[],'VarianceScore':[],'StdDev':[]})

# Let's find and store the cross-validation score for each
# pipeline for training data with raw features.

for ind, val in enumerate(pipelines):
    name, pipeline = val
    kfold = KFold(n_splits=10,random_state=2020, shuffle=True)
    cv_results = cross_val_score(
        pipeline,
        X_train,
        y_train,
        cv=kfold,
        scoring='explained_variance')
    result_train.loc[ind] = [name,cv_results.mean()*100,cv_results.std()*100]


print(result_train)

Experiment Execution

To execute this experiment, we need to create an environment in AzureML that contains scikit learn and pandas. We will be using the Environment.from_pip_requirements function to create the environment discussed in the previous post. The requirements file is as follows:

scikit-learn
pandas
azureml-core
azureml-dataset-runtime

Finally, we create the code to execute the experiment. We discussed the structure of the script previously. What is different is that we are passing the dataset name as an argument through the ScriptRunConfig instance. The arguments parameter of the ScriptRunConfig instance is a list containing the parameter name followed by the value.

from importlib.resources import path
from azureml.core import Workspace, Experiment, ScriptRunConfig, Environment, Dataset
import os
from azuremlproject import constants

def run_experiment():

    config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), '.azureml')
    ws = Workspace.from_config(path=config_path)

    # This is the configuration for the experiment. It tells the experiment what code to run, where to run
    # it, and what compute target to use.
    config = ScriptRunConfig(
        source_directory=os.path.join(os.path.dirname(os.path.realpath(__file__)), 'experiment_4'),
        script='experiment_4.py',
        compute_target=constants.INSTANCE_NAME,
        arguments=[
            '--dataset_name', constants.DATASET_NAME
        ])

    env = Environment.from_pip_requirements(
        name='env-4',
        file_path=os.path.join(os.path.dirname(os.path.realpath(__file__)), 'experiment_4_req.txt')
    )
    config.run_config.environment = env


    experiment = Experiment(ws, constants.EXPERIMENT_NAME)
    run = experiment.submit(config)
    aml_run = run.get_portal_url()
    print(aml_run)

if __name__ == '__main__':
    run_experiment()

Execution starts with creating the Dataset as explained above. We can verify that the Dataset was created by navigating into the datasets folder in the workspace where we should see the Dataset. When clicking on the dataset name, we should see the following:

Of interest is the Explore page, where we can see the data in the Dataset.

concrete strength dataset

The experiment execution gives us a lot of information, including:

concrete strength dataset

Once the experiment is executed, we can see the results in the logs. In this example, we can see the variance score and standard deviation of the two algorithms we used.

Conclusion

Although this is a simple example, it is an excellent example of how we can use Azure ML to train machine learning models. In the next post, we will look at using Azure ML to create pipelines to train machine learning models.




Logistic Regression

Derivation of logistic regression
machine-learning

Notes about Azure ML, Part 11 - Model Validation in AzureML

March 9, 2023
machine-learning azure ml hyperparameter tuning model optimization

Notes about Azure ML, Part 10 - An end-to-end AzureML example; Model Optimization

Creation and execution of an AzureML Model Optimization Experiment
machine-learning azure ml hyperparameter tuning model optimization
comments powered by Disqus


machine-learning 27 python 21 fuzzy 14 azure-ml 11 hugo_cms 11 linear-regression 10 gradient-descent 9 type2-fuzzy 8 type2-fuzzy-library 8 type1-fuzzy 5 cnc 4 dataset 4 datastore 4 it2fs 4 excel 3 paper-workout 3 r 3 c 2 c-sharp 2 experiment 2 hyperparameter-tuning 2 iot 2 model-optimization 2 programming 2 robotics 2 weiszfeld_algorithm 2 arduino 1 automl 1 classifier 1 computation 1 cost-functions 1 development 1 embedded 1 fuzzy-logic 1 game 1 javascript 1 learning 1 mathjax 1 maths 1 mxchip 1 pandas 1 pipeline 1 random_walk 1 roc 1 tools 1 vscode 1 wsl 1