This article data science blogthon.
prologue
State-of-the-art machine learning models and artificial intelligence machines consist of complex processes such as hyperparameter tuning, model selection for better accuracy, and metrics that control this behavior.
There are many types of machine learning algorithms available for training models, but in most cases, choosing the most effective algorithm for your data and prediction requirements becomes increasingly difficult and complex. In addition, it performs a number of preprocessing steps, such as missing feature imputation, removing Nan values, etc., that help clean up the data and prepare it for the ML model.
Experiment with combinations of algorithms and preprocessing transformations to find the best model for your requirements. This takes a lot of time.
Solution architecture
Let’s implement this.
AutoML solution
what are we waiting for? let’s start.
Currently, I have spent a lot of time finding the best model for my project and tuning hyperparameters to improve accuracy. Leverage Azure Machine Learning to automate the comparison of models trained using different algorithms and preprocessing options.
Shocking surprise here! You can use the visual interface to interact with Studio Online or the SDK to create personalized customizations. It comes in your favorite language, Python. The only difference between these two methods is that the SDK gives you more control over the settings of your automated machine learning experiment. Still, the visual interface is easy to use.
Explore huge datasets and understand how the entire machine learning process is automated. Before we go any further, let’s take a look at what AutoML is. This AutoML exercise shows you how to use your Azure subscription and Azure Machine Learning Studio to automatically try multiple preprocessing techniques and model training algorithms in parallel.
Here, explore the power of cloud computing to find the best ML model for your data. Automated ML helps you train models without detailed data science or programming knowledge. For those with data science and programming experience, it offers a way to save time and resources by efficiently automating the process of algorithm selection and hyperparameter tuning.
Let’s start by creating a Machine Learning resource in the Azure cloud. I named the resource blog space because this is for a blog, but feel free to name it whatever you like. I left the default values and didn’t change anything.
After creating the resource group, you will see a page similar to the one above. Click the Studio web URL to go to the Machine Learning Studio, or go to: here Log in with your credentials.
This is what the studio looks like. As you can see, many great features are used by developers all over the world.In the left column, scroll down to[計算]Click. Now create a compute instance and a compute cluster. Accept the default values, but you can choose the VM according to your subscription. I chose Standard_DS11_v2 (2 cores, 14 GB RAM, 28 GB disk) here, but feel free to choose from the list.
code snippet
Let’s start coding!
On your Compute instance, click the Jupyter option, which will open the Jupyter Notebook (make sure you click Jupyter, not Jupyter Lab). Then I created a new notebook and named the notebook Automated ML. Let’s walk through the code cells in the notebook one by one. Running code in this notebook requires the latest versions of the azureml-sdk and azureml-widgets packages, as well as the azureml-train-automl package.
!pip show azureml-train-automl
After installing the required SDK, you can connect to your workspace.
import azureml.core from azureml.core import Workspace ws = Workspace.from_config() print("Ready to use Azure ML {} to work with {}".format(azureml.core.VERSION, ws.name))
You need to load your training data into your notebook. The code below looks complicated, but it looks up the Titanic dataset in the data store. If it doesn’t exist, upload the data and store it in the data store.
Collecting AutoML data
from azureml.core import Dataset default_ds = ws.get_default_datastore() if 'Titanic dataset' not in ws.datasets: default_ds.upload_files(files=['./Titanic.csv'], # Upload the Titanic csv file target_path="Titanic-data/", # Put it in a folder path in the datastore overwrite=True, # Replace existing files of the same name show_progress=True) #Create a tabular dataset from the path on the datastore tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'Titanic-data/*.csv')) # Register the tabular dataset try: tab_data_set = tab_data_set.register(workspace=ws, name="Titanic dataset", description='Titanic data', tags = {'format':'CSV'}, create_new_version=True) print('Dataset registered.') except Exception as ex: print(ex) else: print('Dataset already registered.') # Split the dataset into training and validation subsets diabetes_ds = ws.datasets.get("Titanic dataset") train_ds, test_ds = diabetes_ds.random_split(percentage=0.7, seed=123) print("Data ready!")
Remember the cluster we created earlier? Now let’s connect to it.
from azureml.core.compute import ComputeTarget training_cluster = ComputeTarget(workspace=ws, name="blog-cluster")
One of the most important configuration settings is the metric that evaluates model performance. You can get the list of metrics computed by automated machine learning for a particular type of model task (classification or regression) as follows:
import azureml.train.automl.utilities as automl_utils for metric in automl_utils.get_primary_metrics('classification'): print(metric)
AutoML settings
Once you have decided which metric to optimize (AUC_weighted in this example), you can configure your automated machine learning run. Since this is a simple dataset, we kept the number of iterations to 4. Since this is a simple dataset, we kept the number of iterations to 4.
from azureml.train.automl import AutoMLConfig automl_config = AutoMLConfig(name="Automated ML Experiment", task='classification', compute_target=training_cluster, training_data = train_ds, validation_data = test_ds, label_column_name="Survived", iterations=4, primary_metric="AUC_weighted", max_concurrent_iterations=2, featurization='auto' ) print("Ready for Auto ML run.")
Now that all the configurations are set, we are ready to run the experiment. I set show_output = False, but when I set it to True I can see the model running in real time.
from azureml.core.experiment import Experiment from azureml.widgets import RunDetails print('Submitting Auto ML experiment...') automl_experiment = Experiment(ws, 'Titanic-automl-sdk') automl_run = automl_experiment.submit(automl_config) RunDetails(automl_run).show() automl_run.wait_for_completion(show_output=False)
output
You can get the best performance as below. You can also view best-run transforms and best-run metrics. This code is not in the notebook, but you can try it. I will share the code here.
print('nBest Run Transformations:') for step in fitted_model.named_steps: print(step) print('nBest Run Metrics:') best_run_metrics = best_run.get_metrics() for metric_name in best_run_metrics: metric = best_run_metrics[metric_name] print(metric_name, metric)
Finally, when you find the best performing model, you can register it.
difficulties faced
It wasn’t a cakewalk at all
When I first tried AutoML on my local system, I had some issues connecting my Azure subscription to Visual Studio Code. I couldn’t find a solution, so I migrated to Azure Cloud and created a Python notebook. It turned out to be perfect because I no longer had to create JSON files that required endpoints and subscription keys.
The output of some code cells was a little confusing. The output may look gibberish to someone with little or no knowledge of this domain. For example, when you submit an AutoML experiment to run, you get an extensive list of values, including the models used, their dependencies, and their versions. I had to spend some time understanding and understanding the output.
Conclusion
How does this help you?
Azure AutoML does nothing but empower data scientists and developers to build, deploy, and manage high-quality models faster and with 100% confidence. With such large-scale and complex operations, industries need solutions that can reliably start production simultaneously as quickly as possible.
- Open source interoperability
- Rapid model training and deployment
- Integrated tool
All these features allow us to meet industry standards. The tool helps increase productivity with its Studio feature, a development experience that supports the entire machine learning process, from model building to training to deployment.
Different models require different input data formats. We eliminate this problem by developing accurate models with automatic machine learning for image, text, or tabular models by tuning hyperparameters using feature engineering. Hate Jupyter notebooks? Don’t worry. Use Visual Studio Code to seamlessly move from local to cloud training and scale up or down on powerful cloud-based CPU and GPU clusters.
In summary, here are some reasons why AutoML is preferred over traditional workflows.
Evaluate machine learning models using reproducible and automated workflows,
- model fairness,
- explainability,
- error analysis,
- causal analysis,
- model performance,
- exploratory data analysis,
- Contextualize responsible AI metrics for technical and non-technical audiences to engage stakeholders and streamline compliance reviews.
Trust me, don’t be discouraged. you have to push through it.
That’s all from me. Happy coding!
-Manav Mandal (Microsoft Learn Student Ambassador)
LinkedIn
Instagram
Media shown in this article are not owned by Analytics Vidhya and are used at the author’s discretion.