Notes about Azure ML, Part 8 - An end-to-end AzureML example; Workspace creation and data upload

Thu June 2, 2022
machine-learning azure ml dataset datastore


In the previous posts in this series, we have examined some of the various features of Azure Machine Learning. We have executed some experiments and have seen the results. We will now try to look into a more complete, end to end Machine Learning project in Azure that will utilize leverage the power of other Azure ML features such as:

A disclaimer

This collection of posts is not intended to be a complete guide to Azure Machine Learning model building and deployment or a guide towards building machine learning pipelines. It is designed to be a starting point for those interested in using Azure Machine Learning and introduce some of the exciting features available on the platform.

Structure of these posts

This project has been divided into a number of smaller posts so to limit the length and content of each post. It consists of the following entries:

  1. An introduction to the project, considerations and data uploading to Azure
  2. Training pipeline creation and execution
  3. Hyperparameter tuning
  4. Model Testing
  5. Model Deployment

AzureML development considerations and project structure

I was inspired by a post in stackoverflow that discussed methods to eliminate sibling import problems to create a consistent project structure for my python projects, thus making all my development projects easier and consistent. I have also completed a template for the project structure creation process, found in this repository. I have named the project as azuremlprojectfor this particular exercise.

At a high level, the project structure is as follows:

azuremlproject            This folder contains all the project files.
│   .azureml              This folder contains the Azure Machine Learning config files
|   data                  This folder contains the data that will be used in the experiments
|   docs                  This folder contains the documentation for the project
└───experiments           This folder contains the ML experiments code
│   │   experiment_1      This folder contains the experiment 1
│   │   ...
|   |
│   └───experiment_14     This folder contains experiment 14
|   |   |   deploy        This folder contains the deployment of the model
|   |   |   optimize      This folder contains the hyperparameter optimization
|   |   |   train         This folder contains the training of the model
|   |   |   upload        This folder uploads of the data required for this experiment
|   |   |   validate      This folder contains the validation of the model
|   |
|   workspace_creation    This folder contains the scripts to create workspace

As discussed in the previous posts on this site, I have created several scripts that instantiate the Azure resource group, AzureML Workspace with its Compute resources, and an Azure ML Datastore with the name of the project. One advantage of this approach is that it is easier to delete the resource group and recreate it whenever necessary, thus saving some money. The workspace creation scripts are driven from a parameters file containing the names of the various entities that the user will create, derived from the name of the project. For example, in this project, the project name is set to azmlprj14 and therefore the resource group name is rgazmlprj14, the workspace name is wsazmlprj14 and so on.

Two compute resources are created in the workspace:

The AzureML workspace configuration is then stored in the file .azureml/config.json.

The Project

In this exercise, we will use the UCI Concrete Compressive Strength Dataset, which is both straightforward and easy to use. It is also a small dataset, and therefore we can manipulate and explore the data quickly.

The data set contains the following features:

Feature Description Units Type
Cement (component 1)                    quantitative          kg in a m3 mixture           Input Variable
Blast Furnace Slag (component 2) quantitative kg in a m3 mixture Input Variable
Fly Ash (component 3) quantitative kg in a m3 mixture Input Variable
Water (component 4) quantitative kg in a m3 mixture Input Variable
Superplasticizer (component 5) quantitative kg in a m3 mixture Input Variable
Coarse Aggregate (component 6) quantitative kg in a m3 mixture Input Variable
Fine Aggregate (component 7) quantitative kg in a m3 mixture Input Variable
Age quantitative Day (1~365) Input Variable
Concrete compressive strength quantitative MPa Output Variable

To summarize, the data set contains;

There are many versions of this Dataset, and to load it directly, we have selected a csv version

Data Uploading

The first task of this exercise is to upload the data to Azure ML Datastore and make it available for the experiments through an Azure Machine Learning Dataset. There are many ways to create AzureML Datasets; we will upload a CSV file in this example.

The script required for this task is located under the upload folder.


The steps for uploading the data to the AzureML datastore and registering it as an AzureML Dataframe are the following:

Creates concrete dataset to be used as input for the experiment.
import os
from azureml.core import Workspace, Dataset, Datastore
from import DataPath

from azuremlproject.constants import DATASTORE_NAME

# Name of the dataset.
DATASET_NAME = 'concrete_baseline'

# Load the workspace configuration from the .azureml folder.
config_path = os.path.join(
w_space = Workspace.from_config(path=config_path)

if DATASET_NAME not in Dataset.get_all(w_space).keys():
    data_store = Datastore.get(w_space, DATASTORE_NAME)
    # upload all the files in the concrete data folder to the
    # default datastore in the workspace concrete_data_baseline folder

    # create a new dataset from the uploaded concrete file
    concrete_dataset = Dataset.Tabular.from_delimited_files(
        DataPath(data_store, 'concrete_data_baseline/concrete.csv'))

    # and register it in the workspace
        description='Concrete Strength baseline data (w. header)')

    print('Dataset uploaded')
    print('Dataset already exists')

We can verify that the new dataset was correctly created by checking the Datasets tab in AzureML Workspace, where we should find the dataset concrete_baseline. We can also see the Datastore where we uploaded the data, dsazmlprj14, or the associated datastore created with the workspace. We confirm that the data file, concrete.csv, was uploaded to the concrete_data_baseline folder.

We can also look at the dataset by selecting the Explore tab on this page, which directs us to the data dump of the dataset. We can also see that features and labels are all numerical, as they all have a 00 near their name. We can also see that the dataset has 1030 instances.

This screen can also give us a quick overview of the data through the Preview tab. Here we can see the distribution, and information like type, minimum and maximum values and the mean and standard deviation for each feature.

If we click on the datasource link in the Details tab, AzureML will direct us to the blob storage where the data is stored. Here we can see the concrete.csv file.

The figures below show the screens and the links used to navigate through the dataset.

Dataset Upload

Next Post

In the next post, we will look at creating a pipeline that will enable us to test several different models on the concrete-strength data and select the best performing model.

Logistic Regression

Derivation of logistic regression

Notes about Azure ML, Part 11 - Model Validation in AzureML

March 9, 2023
machine-learning azure ml hyperparameter tuning model optimization

Notes about Azure ML, Part 10 - An end-to-end AzureML example; Model Optimization

Creation and execution of an AzureML Model Optimization Experiment
machine-learning azure ml hyperparameter tuning model optimization
comments powered by Disqus

machine-learning 27 python 21 fuzzy 14 azure-ml 11 hugo_cms 11 linear-regression 10 gradient-descent 9 type2-fuzzy 8 type2-fuzzy-library 8 type1-fuzzy 5 cnc 4 dataset 4 datastore 4 it2fs 4 excel 3 paper-workout 3 r 3 c 2 c-sharp 2 experiment 2 hyperparameter-tuning 2 iot 2 model-optimization 2 programming 2 robotics 2 weiszfeld_algorithm 2 arduino 1 automl 1 classifier 1 computation 1 cost-functions 1 development 1 embedded 1 fuzzy-logic 1 game 1 javascript 1 learning 1 mathjax 1 maths 1 mxchip 1 pandas 1 pipeline 1 random_walk 1 roc 1 tools 1 vscode 1 wsl 1