## Notes about Azure ML, Part 7 - The Concrete Strength Example; training a model

##### Wed March 16, 2022
machine-learning azure ml experiment

In this series’s previous post, we have seen how to create and execute a machine learning experiment in AzureML. Our experiment was not an actual ML experiment but a simple script that printed a message.

This post will see how to create and execute an ML experiment that involves training a model. It is important to note that this post is just the first step in creating a proper ML pipeline in AzureML and is therefore not the best of solutions. In future posts, we will continue improving this process and build a better ML pipeline in AzureML.

This post will also show how we can provide parameters to the experiment, which is useful when we want to train a model with different hyperparameters. In this case, we pass the Dataset to the experiment as a parameter.

## Situation

We will consider the Concrete Compressive Strength Data Set from UCI Machine Learning Repository. The data set contains the following features:

Feature Description Units Type
Cement (component 1)                    quantitative          kg in a m3 mixture           Input Variable
Blast Furnace Slag (component 2) quantitative kg in a m3 mixture Input Variable
Fly Ash (component 3) quantitative kg in a m3 mixture Input Variable
Water (component 4) quantitative kg in a m3 mixture Input Variable
Superplasticizer (component 5) quantitative kg in a m3 mixture Input Variable
Coarse Aggregate (component 6) quantitative kg in a m3 mixture Input Variable
Fine Aggregate (component 7) quantitative kg in a m3 mixture Input Variable
Age quantitative Day (1~365) Input Variable
Concrete compressive strength quantitative MPa Output Variable

The Dataset has, therefore:

• 8 input variables (features)
• 1 output variable (labels)
• all the data is numerical
• feature and label names are pretty long
• 1030 instances in the Dataset
• no missing values in the Dataset

There are many versions of this Dataset, and to load it directly, we have selected a csv version

from azureml.core import Workspace, Dataset
import os
import azuremlproject.constants as constants

def setup_experiment():

config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), '.azureml')
ws = Workspace.from_config(path=config_path)

concrete_data_path =
'https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/concrete.csv'

data_set = Dataset.Tabular.from_delimited_files(concrete_data_path)
data_set.register(
workspace=ws,
name=constants.DATASET_NAME,
description='concrete dataset',
create_new_version=True)

if __name__ == '__main__':
setup_experiment()


We will create a dataset directly from a web file by providing the data URL and calling Dataset.Tabular.from_delimited_files. We then register the Dataset in the workspace, and we can use it in our experiment.

## The experiment

We will be using scikit learn for this experiment. We will compare two regression algorithms to determine the best one for the concrete strength Dataset prediction. The process employed is relatively straightforward:

• We use argparse to parse the command line arguments and retrieve the Dataset.
• We use the train_test_split function from scikit learn to split the data into training and test sets.
• We create a scikit learn pipeline that contains the following steps:
• RobustScaler: This is a robust scaler that scales the data to have a mean of 0 and a standard deviation of 1.
• LinearRegression: This step fits a linear regression model to the training data. We create another pipeline that contains the following steps:
• RobustScaler: This is a robust scaler that scales the data to have a mean of 0 and a standard deviation of 1.
• SVR: This step fits a support vector regression model to the training data.
• we use KFold to split the test data into ten folds.
• We use the cross_val_score function from scikit learn to evaluate the performance of the two models.
• We print the results of the cross-validation.

The code for this experiment is below. We should remember that the code is not the best way to create an ML pipeline in AzureML. In future posts, we will continue improving this process and build a better ML pipeline in AzureML.

'''
test and compare the predictive performance of various ML Models using the amount of explained variance
(in percentage) as an evaluation metric. ML Models used for comparison are:

- Linear Regression
- SVR
'''

import argparse
from asyncio.proactor_events import constants
import pandas as pd
from sklearn.model_selection import train_test_split    # pyright: reportMissingImports=false
from sklearn.preprocessing import RobustScaler          # pyright: reportMissingImports=false
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from azureml.core.dataset import Dataset
from azureml.core.run import Run

run = Run.get_context()
ws = run.experiment.workspace

parser = argparse.ArgumentParser()
'--dataset_name',
type=str,
help='Name of the dataset to use'
)
args = parser.parse_args()

dataset = Dataset.get_by_name(ws, args.dataset_name)

df_orig = dataset.to_pandas_dataframe()

df_orig.columns = [
'cement',
'slag',
'ash',
'water',
'superplastic',
'coarseagg',
'fineagg',
'age',
'strength']

# create features and labels datasets
X = df_orig.drop('strength',axis=1)
y = df_orig['strength']

print('X.shape:', X.shape)
print('y.shape:', y.shape)

# split into test and train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# create a pipeline for each regressor
# pipline will contain the regressor preceeded by a scaler
# all pipelines are stored in a list
pipelines = []
pipelines.append((
'Linear Regression',
Pipeline([('scaler',RobustScaler()),('LR',LinearRegression())])))
pipelines.append((
'SupportVectorRegressor',
Pipeline([('scaler',RobustScaler()),('SVR',SVR())])))

# Create empty dataframe to store the results
result_train = pd.DataFrame({'Regressor':[],'VarianceScore':[],'StdDev':[]})

# Let's find and store the cross-validation score for each
# pipeline for training data with raw features.

for ind, val in enumerate(pipelines):
name, pipeline = val
kfold = KFold(n_splits=10,random_state=2020, shuffle=True)
cv_results = cross_val_score(
pipeline,
X_train,
y_train,
cv=kfold,
scoring='explained_variance')
result_train.loc[ind] = [name,cv_results.mean()*100,cv_results.std()*100]

print(result_train)


## Experiment Execution

To execute this experiment, we need to create an environment in AzureML that contains scikit learn and pandas. We will be using the Environment.from_pip_requirements function to create the environment discussed in the previous post. The requirements file is as follows:

scikit-learn
pandas
azureml-core
azureml-dataset-runtime


Finally, we create the code to execute the experiment. We discussed the structure of the script previously. What is different is that we are passing the dataset name as an argument through the ScriptRunConfig instance. The arguments parameter of the ScriptRunConfig instance is a list containing the parameter name followed by the value.

from importlib.resources import path
from azureml.core import Workspace, Experiment, ScriptRunConfig, Environment, Dataset
import os
from azuremlproject import constants

def run_experiment():

config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), '.azureml')
ws = Workspace.from_config(path=config_path)

# This is the configuration for the experiment. It tells the experiment what code to run, where to run
# it, and what compute target to use.
config = ScriptRunConfig(
source_directory=os.path.join(os.path.dirname(os.path.realpath(__file__)), 'experiment_4'),
script='experiment_4.py',
compute_target=constants.INSTANCE_NAME,
arguments=[
'--dataset_name', constants.DATASET_NAME
])

env = Environment.from_pip_requirements(
name='env-4',
file_path=os.path.join(os.path.dirname(os.path.realpath(__file__)), 'experiment_4_req.txt')
)
config.run_config.environment = env

experiment = Experiment(ws, constants.EXPERIMENT_NAME)
run = experiment.submit(config)
aml_run = run.get_portal_url()
print(aml_run)

if __name__ == '__main__':
run_experiment()


Execution starts with creating the Dataset as explained above. We can verify that the Dataset was created by navigating into the datasets folder in the workspace where we should see the Dataset. When clicking on the dataset name, we should see the following:

• Type; this is a tabular set
• Version; this is the Version of the Dataset, which is 1 in this case
• Description; this is the Description of the Dataset that we provided in the dataset creation
• URL; this is the URL of the Dataset

Of interest is the Explore page, where we can see the data in the Dataset.

The experiment execution gives us a lot of information, including:

• The target compute target used for the experiment
• The arguments passed to the experiment
• The environment used for the experiment will link to more information about the docker image used for the experiment and the packages used in the environment

Once the experiment is executed, we can see the results in the logs. In this example, we can see the variance score and standard deviation of the two algorithms we used.

## Conclusion

Although this is a simple example, it is an excellent example of how we can use Azure ML to train machine learning models. In the next post, we will look at using Azure ML to create pipelines to train machine learning models.

#### Notes about Azure ML, Part 10 - An end-to-end AzureML example; Model Optimization

##### Creation and execution of an AzureML Model Optimization Experiment
machine-learning azure ml hyperparameter tuning model optimization

#### Notes about Azure ML, Part 9 - An end-to-end AzureML example Pipeline creation and execution

##### Creation and execution of a multi-step AzureML pipeline the selects the best model for a given dataset.
machine-learning azure ml pipeline

#### Notes about Azure ML, Part 8 - An end-to-end AzureML example; Workspace creation and data upload

##### In this first post of a series that will cover an end-to-end machine learning project in Azure Ml we will look at how to create a workspace and upload data to it.
machine-learning azure ml dataset datastore