Explore Neptune Scale: tracker for foundation models -> Tour a live project 📈

The Ultimate Guide to Evaluation and Selection of Models in Machine Learning

To properly evaluate your machine learning models and select the best one, you need a good validation strategy and solid evaluation metrics picked for your problem. 

A good validation (evaluation) strategy is basically how you split your data to estimate future test performance. It could be as simple as a train-test split or a complex stratified k-fold strategy. 

Once you know that you can estimate the future model performance, you need to choose a metric that fits your problem . If you understand the classification and regression metrics, then most other complex metrics (in object detection, for example) are relatively easy to grasp.  

When you nail those two, you are good.

In this article, I will talk about:

  • Choosing a good evaluation method (resampling, cross-validation, etc)
  • Popular (and less known) classification and regression metrics
  • And bias / variance trade-offs in machine learning. 

So let’s get to it. 

Note from the product team

It makes model evaluation and selection way easier.

Check the docs

Watch the 2-min product demo

See what comparison functionality Neptune offers

Just to make sure we are on the same page, let’s get the definitions out of the way.

What is model evaluation?

Model evaluation  is a process of assessing the model’s performance on a chosen evaluation setup. It is done by calculating quantitative performance metrics like F1 score or RMSE or assessing the results qualitatively by the subject matter experts. The machine learning evaluation metrics you choose should reflect the business metrics you want to optimize with the machine learning solution.

What is model selection?

Model selection  is the process of choosing the best ml model for a given task. It is done by comparing various model candidates on chosen evaluation metrics calculated on a designed evaluation schema. Choosing the correct evaluation schema , whether a simple train test split or a complex cross-validation strategy, is the crucial first step of building any machine learning solution.

How to evaluate machine learning models and select the best one?

We’ll dive into this deeper, but let me give you a quick step-by-step:

Step 1: Choose a proper validation strategy . Can’t stress this enough, without a reliable way to validate your model performance, no amount of hyperparameter tuning and state-of-the-art models will help you.

Step 2: Choose the right evaluation metric. Figure out the business case behind your model and try to use the machine learning metric that correlates with that. Typically no one metric is ideal for the problem.

So calculate multiple metrics and make your decisions based on that. Sometimes you need to combine classic ML metrics with a subject matter expert evaluation. And that is ok.

Step 3: Keep track of your experiment results . Whether you use a spreadsheet or a dedicated  experiment tracker , make sure to log all the important metrics, learning curves, dataset versions, and configurations. You will thank yourself later.

Step 4: Compare experiments and pick a winner.  Regardless of the metrics and validation strategy you choose, at the end of the day, you want to find the best model. But no model is really best, but some are good enough.

So make sure to understand what is good enough for your problem, and once you hit that, move on to other parts of the project, like model deployment or pipeline orchestration.

Model selection in machine learning (choosing model validation strategy)

Resampling methods.

Resampling methods, as the name suggests, are simple techniques of rearranging data samples to inspect if the model performs well on data samples that it has not been trained on. In other words, resampling helps us understand if the model will generalize well .

Random Split

Random Splits are used to randomly sample a percentage of data into training, testing, and preferably validation sets. The advantage of this method is that there is a good chance that the original population is well represented in all the three sets. In more formal terms, random splitting will prevent a biased sampling of data.

It is very important to note the use of the validation set in model selection. The validation set is the second test set and one might ask, why have two test sets?

In the process of feature selection and model tuning, the test set is used for model evaluation. This means that the model parameters and the feature set are selected such that they give an optimal result on the test set. Thus, the validation set which has completely unseen data points (not been used in the tuning and feature selection modules) is used for the final evaluation.

Time-Based Split

There are some types of data where random splits are not possible . For example, if we have to train a model for weather forecasting, we cannot randomly divide the data into training and testing sets. This will jumble up the seasonal pattern! Such data is often referred to by the term – Time Series.

In such cases, a time-wise split is used. The training set can have data for the last three years and 10 months of the present year. The last two months can be reserved for the testing or validation set.

There is also a concept of window sets – where the model is trained till a particular date and tested on the future dates iteratively such that the training window keeps increasing shifting by one day (consequently, the test set also reduces by a day). The advantage of this method is that it stabilizes the model and prevents overfitting when the test set is very small (say, 3 to 7 days).  

However, the drawback of time-series data is that the events or data points are not mutually independent . One event might affect every data input that follows after. 

For instance, a change in the governing party might considerably change the population statistics for the years to follow. Or the infamous coronavirus pandemic is going to have a massive impact on economic data for the next few years. 

No machine learning model can learn from past data in such a case because the data points before and after the event have major differences.

K-Fold Cross-Validation

The cross-validation technique works by randomly shuffling the dataset and then splitting it into k groups. Thereafter, on iterating over each group, the group needs to be considered as a test set while all other groups are clubbed together into the training set. The model is tested on the test group and the process continues for k groups.

Thus, by the end of the process, one has k different results on k different test groups. The best model can then be selected easily by choosing the one with the highest score.

7 Cross-Validation Mistakes That Can Cost You a Lot [Best Practices in ML]

Stratified k-fold.

The process for stratified K-Fold is similar to that of K-Fold cross-validation with one single point of difference – unlike in k-fold cross-validation, the values of the target variable is taken into consideration in stratified k-fold.

If for instance, the target variable is a categorical variable with 2 classes, then stratified k-fold ensures that each test fold gets an equal ratio of the two classes when compared to the training set.

This makes the model evaluation more accurate and the model training less biased.

Bootstrap is one of the most powerful ways to obtain a stabilized model. It is close to the random splitting technique since it follows the concept of random sampling.

The first step is to select a sample size (which is usually equal to the size of the original dataset). Thereafter, a sample data point must be randomly selected from the original dataset and added to the bootstrap sample. After the addition, the sample needs to be put back into the original sample. This process needs to be repeated for N times, where N is the sample size.

Therefore, it is a resampling technique that creates the bootstrap sample by sampling data points from the original dataset with replacement . This means that the bootstrap sample can contain multiple instances of the same data point.

The model is trained on the bootstrap sample and then evaluated on all those data points that did not make it to the bootstrapped sample. These are called the out-of-bag samples.

Healthcare startup Theta Tech AI uses the neptune.ai experiment tracker for that.

Grouping by validation set is super important to us, and many other people would benefit from using the grouping features with validation. Dr. Robert Toth, Founder of Theta Tech AI

Full case study with Theta Tech AI

Dive into documentation

Get in touch  if you’d like to go through a custom demo with your team

Probabilistic measures

Probabilistic Measures do not just take into account the model performance but also the model complexity . Model complexity is the measure of the model’s ability to capture the variance in the data. 

For example, a highly biased model like the linear regression algorithm is less complex and on the other hand, a neural network is very high on complexity.

Another important point to note here is that the model performance taken into account in probabilistic measures is calculated from the training set only . A hold-out test set is typically not required.

A fair bit of disadvantage however lies in the fact that probabilistic measures do not consider the uncertainty of the models and has a chance of selecting simpler models over complex models.

Akaike Information Criterion (AIC)

It is common knowledge that every model is not completely accurate. There is always some information loss which can be measured using the KL information metric. Kulback-Liebler or KL divergence is the measure of the difference in the probability distribution of two variables.

A statistician, Hirotugu Akaike, took into consideration the relationship between KL Information and Maximum Likelihood (in maximum-likelihood, one wishes to maximize the conditional probability of observing a datapoint X, given the parameters and a specified probability distribution) and developed the concept of Information Criterion (or IC). Therefore, Akaike’s IC or AIC is the measure of information loss. This is how the discrepancy between two different models is captured and the model with the least information loss is suggested as the model of choice.

experiment machine learning model

  • K = number of independent variables or predictors
  • L = maximum-likelihood of the model 
  • N = number of data points in the training set (especially helpful in case of small datasets)

The limitation of AIC is that it is not very good with generalizing models as it tends to select complex models that lose less training information.

Bayesian Information Criterion (BIC)

BIC was derived from the Bayesian probability concept and is suited for models that are trained under the maximum likelihood estimation.

experiment machine learning model

  • K = number of independent variables
  • L = maximum-likelihood
  • N = Number of sampler/data points in the training set

BIC penalizes the model for its complexity and is preferably used when the size of the dataset is not very small (otherwise it tends to settle on very simple models).

Minimum Description Length (MDL)

MDL is derived from the Information theory which deals with quantities such as entropy that measure the average number of bits required to represent an event from a probability distribution or a random variable.

MDL or the minimum description length is the minimum number of such bits required to represent the model.

experiment machine learning model

  • D = predictions made by the model
  • L(h) = number of bits required to represent the model
  • L(D | h) = number of bits required to represent the predictions from the model

Structural Risk Minimization (SRM)

Machine learning models face the inevitable problem of defining a generalized theory from a set of finite data. This leads to cases of overfitting where the model gets biased to the training data which is its primary learning source. SRM tries to balance out the model’s complexity against its success at fitting on the data.

How to evaluate ML models (choosing performance metrics)

Models can be evaluated using multiple metrics. However, the right choice of an evaluation metric is crucial and often depends upon the problem that is being solved. A clear understanding of a wide range of metrics can help the evaluator to chance upon an appropriate match of the problem statement and a metric.

Classification metrics

For every classification model prediction, a matrix called the confusion matrix can be constructed which demonstrates the number of test cases correctly and incorrectly classified. 

It looks something like this (considering 1 -Positive and 0 -Negative are the target classes):

  • TN: Number of negative cases correctly classified
  • TP: Number of positive cases correctly classified
  • FN: Number of positive cases incorrectly classified as negative
  • FP: Number of negative cases correctly classified as positive

Accuracy is the simplest metric and can be defined as the number of test cases correctly classified divided by the total number of test cases.

experiment machine learning model

It can be applied to most generic problems but is not very useful when it comes to unbalanced datasets. 

For instance, if we are detecting frauds in bank data, the ratio of fraud to non-fraud cases can be 1:99. In such cases, if accuracy is used, the model will turn out to be 99% accurate by predicting all test cases as non-fraud. The 99% accurate model will be completely useless.

If a model is poorly trained such that it predicts all the 1000 (say) data points as non-frauds, it will be missing out on the 10 fraud data points. If accuracy is measured, it will show that that model correctly predicts 990 data points and thus, it will have an accuracy of (990/1000)*100 = 99%! 

This is why accuracy is a false indicator of the model’s health.

Therefore, for such a case, a metric is required that can focus on the ten fraud data points which were completely missed by the model.

Precision is the metric used to identify the correctness of classification.

experiment machine learning model

Intuitively, this equation is the ratio of correct positive classifications to the total number of predicted positive classifications. The greater the fraction, the higher is the precision, which means better is the ability of the model to correctly classify the positive class.

In the problem of predictive maintenance (where one must predict in advance when a machine needs to be repaired), precision comes into play. The cost of maintenance is usually high and thus, incorrect predictions can lead to a loss for the company. In such cases, the ability of the model to correctly classify the positive class and to lower the number of false positives is paramount!

Recall tells us the number of positive cases correctly identified out of the total number of positive cases. 

experiment machine learning model

Going back to the fraud problem, the recall value will be very useful in fraud cases because a high recall value will indicate that a lot of fraud cases were identified out of the total number of frauds.

F1 score is the harmonic mean of Recall and Precision and therefore, balances out the strengths of each. 

It is useful in cases where both recall and precision can be valuable – like in the identification of plane parts that might require repairing. Here, precision will be required to save on the company’s cost (because plane parts are extremely expensive) and recall will be required to ensure that the machinery is stable and not a threat to human lives.

experiment machine learning model

ROC curve is a plot of true positive rate (recall) against false positive rate (TN / (TN+FP)). AUC-ROC stands for Area Under the Receiver Operating Characteristics and the higher the area, the better is the model performance. 

If the curve is somewhere near the 50% diagonal line, it suggests that the model randomly predicts the output variable.

AUC ROC curve

️ F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?

Log loss is a very effective classification metric and is equivalent to -1* log (likelihood function) where the likelihood function suggests how likely the model thinks the observed set of outcomes was. 

Since the likelihood function provides very small values, a better way to interpret them is by converting the values to log and the negative is added to reverse the order of the metric such that a lower loss score suggests a better model.

Gain and Lift Charts

Gain and lift charts are tools that evaluate model performance just like the confusion matrix but with a subtle, yet significant difference. The confusion matrix determines the performance of the model on the whole population or the entire test set, whereas the gain and lift charts evaluate the model on portions of the whole population. Therefore, we have a score (y-axis) for every % of the population (x-axis). 

Lift charts measure the improvement that a model brings in compared to random predictions. The improvement is referred to as the ‘lift’.

The K-S chart or Kolmogorov-Smirnov chart determines the degree of separation between two distributions – the positive class distribution and the negative class distribution. The higher the difference, the better is the model at separating the positive and negative cases.

Regression metrics

Regression models provide a continuous output variable, unlike classification models that have discrete output variables. Therefore, the metrics for assessing the regression models are accordingly designed.

Mean Squared Error or MSE

MSE is a simple metric that calculates the difference between the actual value and the predicted value (error), squares it and then provides the mean of all the errors.

experiment machine learning model

MSE is very sensitive to outliers and will show a very high error value even if a few outliers are present in the otherwise well-fitted model predictions.

Root Mean Squared Error or RMSE

RMSE is the root of MSE and is beneficial because it helps to bring down the scale of the errors closer to the actual values, making it more interpretable.

Compare all model metrics in neptune.ai

Mean absolute error or mae.

MAE is the mean of the absolute error values (actuals – predictions).

experiment machine learning model

If one wants to ignore the outlier values to a certain degree, MAE is the choice since it reduces the penalty of the outliers significantly with the removal of the square terms.

Root Mean Squared Log Error or RMSLE

In RMSLE, the same equation as that of RMSE is followed except for an added log function along with the actual and predicted values.

experiment machine learning model

x is the actual value and y is the predicted value. This helps to scale down the effect of the outliers by downplaying the higher error rates with the log function. Also, RMSLE helps to capture a relative error (by comparing all the error values) through the use of logs.

R-Square helps to identify the proportion of variance of the target variable that can be captured with the help of the independent variables or predictors. 

experiment machine learning model

R-square, however, has a gigantic problem. Say, a new unrelated feature is added to a model with an assigned weight of w. If the model finds absolutely no correlation between the new predictor and the target variable, w is 0. However, there is almost always a small correlation due to randomness which adds a small positive weight (w>0) and a new loss minimum is achieved due to overfitting.

This is why the R-squared increases with any new feature addition. Thus, its inability to decrease in value when new features are added limits its ability to identify if the model did better with lesser features.

Adjusted R-Squared

Adjusted R-Square solves the problem of R-Square by dismissing its inability to reduce in value with added features. It penalizes the score as more features are added.

experiment machine learning model

The denominator here is the magic element which increases with the increase in the number of features. Therefore, a significant increase in R 2 is required to increase the overall value.

Clustering metrics

Clustering algorithms predict groups of datapoints and hence, distance-based metrics are most effective.

Dunn Index focuses on identifying clusters that have low variance (among all members in the cluster) and are compact. The mean values of the different clusters also need to be far apart.

experiment machine learning model

  • δ(Xi, Yj) is the intercluster distance i.e. the distance between Xi and Xj
  • ∆(Xk) is the intercluster distance of cluster Xk i.e.distance within the cluster Xk

However, the disadvantage of Dunn index is that with a higher number of clusters and more dimensions, the computation cost increases.

Silhouette Coefficient

Silhouette Coefficient tracks how every point in one cluster is close to every point in the other clusters in the range of -1 to +1.: 

  • Higher Silhouette values (closer to +1) indicate that the sample points from two different clusters are far away. 
  • 0 indicates that the points are close to the decision boundary 
  • and values closer to -1 suggests that the points have been incorrectly assigned to the cluster.

Elbow method

The elbow method is used to determine the number of clusters in a dataset by plotting the number of clusters on the x-axis against the percentage of variance explained on the y-axis. The point in x-axis where the curve suddenly bends (the elbow) is considered to suggest the optimal number of clusters.

clusters

  • 24 Evaluation Metrics for Binary Classification (And When to Use Them)

Trade-offs in ml model selection

Bias vs variance.

On a high level, Machine Learning is the union of statistics and computation. The crux of machine learning revolves around the concept of algorithms or models, which are in fact, statistical estimations on steroids.

However, any given model has several limitations depending on the data distribution. None of them can be entirely accurate since they are just  estimations (even if on steroids) . These limitations are popularly known by the name of  bias  and  variance . 

A  model with high bias  will oversimplify by not paying much attention to the training points (e.g.: in Linear Regression, irrespective of data distribution, the model will always assume a linear relationship). 

Bias occurs when a model is strictly ruled by assumptions – like the linear regression model assumes that the relationship of the output variable with the independent variables is a straight line. This leads to underfitting when the actual values are non-linearly related to the independent variables.

A  model with high variance  will restrict itself to the training data by not generalizing for test points that it hasn’t seen before (e.g.:  Random Forest  with max_depth = None).

Variance is high when a model focuses on the training set too much and learns the variations very closely, compromising on generalization. This leads to overfitting .

The issue arises when the limitations are subtle, like when we have to choose between a random forest algorithm and a gradient boosting algorithm or between two variations of the same decision tree algorithm. Both will tend to have high variance and low bias.

An optimal model is one that has the lowest bias and variance and since these two attributes are indirectly proportional, the only way to achieve this is through a tradeoff between the two. Therefore, the model selection should be such that the bias and variance intersect like in the image below.

experiment machine learning model

This can be achieved by iteratively tuning the hyperparameters of the model in use (Hyperparameters are the input parameters that are fed to the model functions). After every iteration, the model evaluation must take place with the use of a suitable metric.  

Learning curves

The best way to track the progress of model training or build-up is to use learning curves. These curves help to identify the optimal points in a set of hyperparameter combinations and assists massively in the model selection and model evaluation process.

Typically, a learning curve is a way to track the learning or improvement in the ML model performance on the y-axis and the time or experience on the x-axis.

The two most popular learning curves are:

  • Training Learning Curve – It effectively plots the evaluation metric score overtime during a training process and thus, helps to track the learning or progress of the model during training.
  • Validation Learning Curve – In this curve, the evaluation metric score is plotted against time on the validation set. 

Sometimes it might so happen that the training curve shows an improvement but the validation curve shows stunted performance. 

This is indicative of the fact that the model is overfitting and needs to be reverted to the previous iterations. In other words, the validation learning curve identifies how well the model is generalizing.

Therefore, there is a tradeoff between the training learning curve and the validation learning curve and the model selection technique must rely upon the point where both the curves intersect and are at their lowest.

Ok, but how do you actually do it?

What is next

Evaluating ML models and selecting the best-performing one is one of the main activities you do in pre-production. 

Hopefully, with this article, you’ve learned how to properly set up a model validation strategy and then how to choose a metric for your problem. 

You are ready to run a bunch of experiments and see what works. 

With that comes another problem of keeping track of experiment parameters, datasets used, configs, and results. 

And figuring out how to visualize and compare all of those models and results. 

For that, you may want to check out:

  • “ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It” 
  • “15 Best Tools for ML Experiment Tracking and Management”
  • “Visualizing Machine Learning Models: Guide and Tools”
  • “How to Compare Machine Learning Models and Algorithms”

Other resources

Cross-validation and evaluation strategies from Kaggle competitions:

  • Image Classification: Tips and Tricks From 13 Kaggle Competitions (+ Tons of References)
  • Binary Classification: Tips and Tricks From 10 Kaggle Competitions
  • Tabular Data Binary Classification: All Tips and Tricks from 5 Kaggle Competitions
  • Text Classification: All Tips and Tricks from 5 Kaggle Competitions
  • Image Segmentation: Tips and Tricks from 39 Kaggle Competitions

Evaluation metrics and visualization:

  • Recommender Systems: Machine Learning Metrics and Business Metrics
  • How to Track Machine Learning Model Metrics in Your Projects
  • The Best Tools to Visualize Metrics and Hyperparameters of Machine Learning Experiments
  • How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way)

Experiment tracking videos and real-world case studies:

  • Selecting the best computer vision models at Brainly
  • How to Use CI/CD to Automate the RL Evaluation Pipeline
  • Comparing CI/CD pipeline runs at Continuum Industries
  • How to Compare Images Between Runs
  • Scaling ML research at AILS Labs
  • Visualizing hyperparameter optimization studies at Theta Tech AI
  • How to Monitor Model Training Runs Live

Was the article useful?

More about the ultimate guide to evaluation and selection of models in machine learning, check out our product resources and related articles below:, how to compare machine learning models and algorithms, llm fine-tuning and model selection using neptune and transformers, ml experiment tracking: what it is, why it matters, and how to implement it, how deepsense.ai tracked and analyzed 120k+ models using neptune, explore more content topics:, manage your model metadata in a single place.

Join 50,000+ ML Engineers & Data Scientists using Neptune to easily log, compare, register, and share ML metadata.

Experiment Tracking in Machine Learning – Everything You Need to Know

  • Gaudenz Boesch
  • February 1, 2024

What is experiment tracking in machine learning

Build, deploy, operate computer vision at scale

  • One platform for all use cases
  • Connect all your cameras
  • Flexible for your needs

Experiment tracking, or experiment logging, is a key aspect of MLOps. Tracking experiments is important for iterative model development, which is the part of the ML project lifecycle where you try many things to get your model performance to the level you need.

In this article, we will answer the following questions:

  • What is experiment tracking in ML?
  • How does it work? (tutorials and examples)
  • How to implement experiment logging?
  • What are best practices?

About us:  We are the creators of the enterprise Computer Vision Platform, Viso Suite. Viso Suite allows for the control of the complete ML pipeline, including data collection, training, and model evaluation. Learn about Viso Suite and book a demo .

What is Experiment Tracking for Machine Learning?

Experiment tracking is recording relevant metadata while developing a machine learning model. It provides researchers with a method to keep track of important changes during each iteration.

In this context, “experiment” refers to a specific iteration or version of the model. You can think of it in terms of any other scientific experiment:

  • You start with a hypothesis , i.e., how new changes will impact the outcomes.
  • Then, you adjust the inputs (code, datasets, or hyperparameters) accordingly and run the experiment.
  • Finally, you record the outputs . These outputs can be the results of performance benchmarks or an entirely new ML model.

Machine learning experiment tracking makes it possible to trace the exact cause and effect of changes to the model.

These parameters include, amongst others:

  • Hyperparameters: Learning rate, batch size, number of layers, neurons per layer, activation functions , dropout rates.
  • Model Performance Metrics: Accuracy, precision, recall , F1 score, area under the ROC curve (AUC-ROC).
  • Training Parameters: Number of epochs, loss function, optimizer type.
  • Hardware Usage: CPU/GPU utilization, memory usage.
  • Dataset Metrics: Size of training/validation/test sets, data augmentation techniques used.
  • Training Environment: Configuration files of the underlying system or software ecosystem.
  • Versioning Information: Model version, dataset version, code version.
  • Run Metadata: Timestamp of the run, duration of training, experiment ID.
  • Model-specific Data: Model weights or other tuning parameters.

Experiment tracking involves tracking all information to create reproducible results across every stage of the ML model development process.

Why is ML Experiment Tracking Important?

We derive machine learning models through an iterative process of trial and error . Researchers can adjust any number of parameters in various combinations to produce different results. A model can also go through an immense number of adaptations or versions before it reaches its final form.

Without knowing the what, why, and how, it’s impossible to draw informed conclusions about the model’s progress. Unfortunately, due to the complexity of these models, the causality between inputs and outputs is often non-obvious.

Visual graphs showing the different types of complex data that must be tracked in ML experiment.

A small change in any of the parameters above can significantly change the output. An ad hoc approach to tracking these changes and their effects on the model simply won’t cut it. This is particularly important for related tasks, such as hyperparameter optimization.

Data scientists need a formal process to track these changes over the lifetime of the development process. Experiment tracking makes it possible to compare and reproduce the results across iterations.

This allows them to understand past results and the cause-and-effect of adjusting various parameters. More importantly, it will help to more efficiently steer the training process in the right direction.

How does an ML Experiment Tracking System work?

Before we look at different methods to implement experiment tracking, let’s see what a solution should look like.

At the very least, an experiment-tracking solution should provide you with:

  • A centralized hub to store, organize, access, and manage experiment records.
  • Easy integration with your ML model training framework(s).
  • An efficient and accurate way to capture and record essential data .
  • An intuitive and accessible way to pull up records and compare them .
  • A way to leverage visualizations to represent data in ways that make sense to non-technical stakeholders.

For more advanced ML models or larger machine learning projects, you may also need the following:

  • The ability to track and report hardware resource consumption (monitoring of CPU, GPU utilization, and memory usage).
  • Integration with version control systems to track code, dataset, and ML model changes.
  • Collaboration features to facilitate team productivity and communication.
  • Custom reporting tools and dashboards .
  • The ability to scale with the growing number of experiments and provide robust security .

Experiment tracking should only be as complex as you need it to be. That’s why techniques vary from manually using paper or spreadsheets to fully automated commercial-off-the-shelf tools.

The centralized and collaborative aspect of experiment tracking is particularly important. You may conduct experiments on ML models in a variety of contexts. For example, on your office laptop at your desk. Or to run an ad hoc hyperparameter tuning job using a dedicated instance in the cloud.

Now, extrapolate this challenge across multiple individuals or teams.

If you don’t properly record or sync experiments, you may need to repeat work. Or, worst case, lose the details of a well-performing experiment.

experiment tracking and logging to better understand machine learning model performance

Best Practices in ML Experiment Tracking

So, we know that experiment tracking is vital to accurately reproduce experiments. It allows debugging and understanding ML models at a granular level. We also know the components that an effective experiment tracking solution should consist of.

However, there are also some best practices you should stick to.

  • Establish a standardized tracking protocol: You need to have a consistent practice of experiment tracking across ML projects . This includes standardizing code documentation, version documentation, data sets, parameters (input data), and results (output data).
  • Have rigorous version control: Implement both code and data version control. This helps track changes over time and to understand the impact of each modification. You can use tools like Git for code and DVC for data.
  • Automate data logging: You should automate experiment logging as much as possible. This includes capturing hyperparameters, model architectures , training procedures, and outcomes. Automation reduces human error and enhances consistency.
  • Implement meticulous documentation: Alongside automated logging, explain the rationale behind each experiment, the hypotheses tested, and interpretations of results. Contextual information is invaluable for future reference when working on dynamic ML models.
  • Opt for scalable and accessible tracking tools: This will help avoid delays due to operational constraints or the need for training.
  • Prioritize reproducibility: Check that you can reproduce the results of individual experiments. You need detailed information about the environment, dependencies, and random seeds to do this accurately.
  • Regular reviews and audits: Reviewing experiment processes and logs can help identify gaps in the tracking process. This allows you to refine your tracking system and make better decisions on future experiments.
  • Incorporate feedback loops: Similarly, this will help you incorporate learnings from past experiments into new ones. It will also help with team buy-in and address shortcomings in your methodologies.
  • Balance detail and overhead: Over-tracking can lead to unnecessary complexity, whereas insufficient tracking can miss critical insights. It’s important to find a balance depending on the complexity of your ML models and needs.

YOLOv8 object detection

Difference Between Experiment Tracking and MLOps

If you work in an ML team, you’re probably already familiar with MLOps (machine learning operations) . MLOps is the process of holistically managing the end-to-end machine learning development lifecycle. It spans everything from:

  • Developing and training models,
  • Scheduling jobs,
  • Model testing,
  • Deploying models,
  • Model maintenance,
  • Managing model serving, to
  • Monitoring and retraining models in production

Experiment tracking is a specialized sub-discipline within the MLOps field. Its primary focus is the iterative development phase, which involves primarily training and testing models. Not to mention experimenting with various models, parameters, and data sets to optimize performance.

More specifically, it’s the process of tracking and utilizing the relevant metadata of each experiment.

Experiment tracking is especially critical for MLOps in research-focused projects. In these projects, models may never even reach production. Instead, experiment tracking offers valuable insights into model performance and the efficacy of different approaches.

This may help inform or direct future ML projects without having an immediate application or end goal. MLOps is of more critical concern in projects that will enter production and deployment.

mlops overview of tasks and common actors involved

How to Implement Experiment Tracking

Machine learning projects come in different shapes and sizes. Accordingly, there are a variety of ways you can track your experiments.

You should carefully select the best approach depending on:

  • The size of your team
  • The number of experiments you plan to run
  • The complexity of your ML models
  • The level of detail you require regarding experiment metadata
  • The key goals of your project/research. I.e., improving capabilities in a specific task or optimizing performance

Some of the common methods used today include:

  • Manual tracking using spreadsheets and naming conventions
  • Using software versioning tools/repositories
  • Automated tracking using dedicated ML experiment tracking tools

Let’s do a quick overview of each.

Manual Tracking

This involves using spreadsheets to manually log experiment details. Typically, you’ll use systematic naming conventions to organize files and experiments in directories. For example, by naming them after elements like model version, learning rate, batch size, and main results.

For example, model_v3_lr0.01_bs64_acc0.82.h5 might indicate version 3 of a model with a learning rate of 0.01 , a batch size of 64 , and an accuracy of 82% .

Screenshot of a manual example experiment tracking spreadsheet with related metadata.

This method is easy to implement, but it falls apart at scale. Not only is there a risk of logging incorrect information, but also overwriting others’ work. Plus, manually ensuring version tracking conventions are being followed can be time-consuming and difficult.

Still, it may be suitable for small-scale or personal research projects using tools like Excel or Google Sheets.

Automated Versioning in a Git Repository

You can use a version control system, like Git, to track changes in machine learning experiments. Each experiment’s metadata (like hyperparameters, model configurations, and results) is stored as files in a Git repository. These files can include text documents, code, configuration files, and even serialized versions of models .

After each experiment, it commits changes to the repository, creating a trackable history of the experiment iterations.

While it’s not fully automated, it does bring some of its benefits. For example, the system will automatically follow the naming conventions you implement. This reduces the risk of human error when you log metrics or other data.

It’s also much easier to revert to older versions without having to create, organize, and find copies manually. They have the built-in ability to branch and run parallel workflows.

These systems also have built-in collaboration, making it easy for team members to track changes and stay in sync. Plus, it’s relatively technology-agnostic, so you can use it across projects or frameworks.

However, not all these systems are optimized for large binary files. This is especially true for ML models where huge data sets containing model weights and other metadata are common. They also have limited features for visualizing and comparing experiments, not to mention live monitoring.

This approach is highly useful for projects that require a detailed history of changes. Also, many developers are familiar with deep learning platforms like Git, so adoption should be seamless for most teams. However, it still lacks some of the advanced capabilities on offer with dedicated MLOps or experiment tracking software.

experiment machine learning model

Using Modern Experiment Tracking Tools

There are specialized software solutions designed to systematically record, organize, and compare data from machine learning experiments. designed specifically for ML projects, they typically offer seamless integration with common models and frameworks.

On top of tools to track and store data, they also offer a user interface for viewing and analyzing results. This includes the ability to visualize data and create custom reports. Developers can also typically leverage APIs for logging data to various systems and compare different runs. Plus, they can monitor the experiment’s progress in real time.

Built for ML models, they excel at tracking hyperparameters, evaluation metrics, model weights, and outputs. Their functionalities are well-suited to typical ML tasks.

Common experiment tracking tools include:

  • TensorBoard
  • Weights & Biases

The Viso Suite platform also offers robust experiment tracking through its model evaluation tools. You can gain comprehensive insights into the performance of your computer vision experiments. Its range of functionalities includes regression, classification , detection analyses , semantic, and instance segmentation , etc.

You can use this information to identify anomalies, mine hard samples, and detect incorrect prediction patterns. Interactive plots and label rendering on images facilitate data understanding, augmenting your MLOps decision-making.

What’s Next With Experiment Tracking?

Experiment tracking is a key component of ML model development, allowing for the recording and analysis of metadata for each iteration. Integration with comprehensive MLOps practices enhances model lifecycle management and operational efficiency, meaning that organizations can drive continuous improvement and innovation in their ML initiatives.

As experiment tracking tools and methodologies evolve, we can expect to see the model development process change and improve as well.

All-in-one platform to build, deploy, and scale computer vision applications

experiment machine learning model

  • Stacked scrolling 1 Platform
  • Certificate check Solutions
  • Blog Viso Blog
  • Forum Contact us

Privacy Overview

CookieDurationDescription
cookielawinfo-checkbox-advertisement1 yearSet by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
elementorneverThis cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONIDsessionThe JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
ZCAMPAIGN_CSRF_TOKENsessionThis cookie is used to distinguish between humans and bots.
zfccnsessionZoho sets this cookie for website security when a request is sent to campaigns.
CookieDurationDescription
_zcsr_tmpsessionZoho sets this cookie for the login function on the website.
CookieDurationDescription
_gat1 minuteThis cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
CookieDurationDescription
_ga2 yearsThe _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_177371481_21 minuteSet by Google to distinguish users.
_gid1 dayInstalled by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT2 yearsYouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
zabUserId1 yearThis cookie is set by Zoho and identifies whether users are returning or visiting the website for the first time
zabVisitIdone yearUsed for identifying returning visits of users to the webpage.
zft-sdc24hoursIt records data about the user's navigation and behavior on the website. This is used to compile statistical reports and heat maps to improve the website experience.
zps-tgr-dts1 yearThese cookies are used to measure and analyze the traffic of this website and expire in 1 year.
CookieDurationDescription
VISITOR_INFO1_LIVE5 months 27 daysA cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSCsessionYSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devicesneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-idneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
CookieDurationDescription
2d719b1dd3sessionThis cookie has not yet been given a description. Our team is working to provide more information.
4662279173sessionThis cookie is used by Zoho Page Sense to improve the user experience.
ad2d102645sessionThis cookie has not yet been given a description. Our team is working to provide more information.
zc_consent1 yearNo description available.
zc_show1 yearNo description available.
zsc2feeae1d12f14395b6d5128904ae37461 minuteThis cookie has not yet been given a description. Our team is working to provide more information.

experiment machine learning model

Experiment Tracking

Goku Mohandas

Subscribe to our newsletter

📬  Receive new lessons straight to your inbox (once a month) and join 40K+ developers in learning how to responsibly deliver value with ML.

So far, we've been training and evaluating our different baselines but haven't really been tracking these experiments. We'll fix this but defining a proper process for experiment tracking which we'll use for all future experiments (including hyperparameter optimization). Experiment tracking is the process of managing all the different experiments and their components, such as parameters, metrics, models and other artifacts and it enables us to:

  • Organize all the necessary components of a specific experiment. It's important to have everything in one place and know where it is so you can use them later.
  • Reproduce past results (easily) using saved experiments.
  • Log iterative improvements across time, data, ideas, teams, etc.

There are many options for experiment tracking but we're going to use MLFlow (100% free and open-source ) because it has all the functionality we'll need. We can run MLFlow on our own servers and databases so there are no storage cost / limitations, making it one of the most popular options and is used by Microsoft, Facebook, Databricks and others. There are also several popular options such as a Comet ML (used by Google AI, HuggingFace, etc.), Neptune (used by Roche, NewYorker, etc.), Weights and Biases (used by Open AI, Toyota Research, etc.). These are fully managed solutions that provide features like dashboards, reports, etc.

We'll start by setting up our model registry where all of our experiments and their artifacts will be stores.

1 2 3 4 mlflow from pathlib import Path from ray.air.integrations.mlflow import MLflowLoggerCallback import time
1 2 3 4 5 6 MODEL_REGISTRY = Path("/tmp/mlflow") Path(MODEL_REGISTRY).mkdir(parents=True, exist_ok=True) MLFLOW_TRACKING_URI = "file://" + str(MODEL_REGISTRY.absolute()) mlflow.set_tracking_uri(MLFLOW_TRACKING_URI) print (mlflow.get_tracking_uri())
On Windows, the tracking URI should have three forwards slashes: 1 MLFLOW_TRACKING_URI = "file:///" + str ( MODEL_REGISTRY . absolute ())

In this course, our MLflow artifact and backend store will both be on our local machine. In a production setting, these would be remote such as S3 for the artifact store and a database service (ex. PostgreSQL RDS) as our backend store.

Integration

While we could use MLflow directly to log metrics, artifacts and parameters:

1 2 3 4 mlflow.log_metrics({"train_loss": train_loss, "val_loss": val_loss}, step=epoch) mlflow.log_artifacts(dir) mlflow.log_params(config)

We'll instead use Ray to integrate with MLflow. Specifically we'll use the MLflowLoggerCallback which will automatically log all the necessary components of our experiments to the location specified in our MLFLOW_TRACKING_URI . We of course can still use MLflow directly if we want to log something that's not automatically logged by the callback. And if we're using other experiment trackers, Ray has integrations for those as well.

1 2 3 4 5 6 experiment_name = f"llm-{int(time.time())}" mlflow_callback = MLflowLoggerCallback( tracking_uri=MLFLOW_TRACKING_URI, experiment_name=experiment_name, save_artifact=True)

Once we have the callback defined, all we have to do is update our RunConfig to include it.

1 2 3 4 5 run_config = RunConfig( callbacks=[mlflow_callback], checkpoint_config=checkpoint_config, )

With our updated RunConfig , with the MLflow callback, we can now train our model and all the necessary components will be logged to MLflow. This is the exact same training workflow we've been using so far from the training lesson .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ds = load_data() train_ds, val_ds = stratify_split(ds, stratify="tag", test_size=test_size) # Preprocess preprocessor = CustomPreprocessor() train_ds = preprocessor.fit_transform(train_ds) val_ds = preprocessor.transform(val_ds) train_ds = train_ds.materialize() val_ds = val_ds.materialize() # Trainer trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, train_loop_config=train_loop_config, scaling_config=scaling_config, run_config=run_config, # uses RunConfig with MLflow callback datasets={"train": train_ds, "val": val_ds}, dataset_config=dataset_config, preprocessor=preprocessor, ) # Train results = trainer.fit()
Trial name status loc iter total time (s) epoch lr train_loss
TorchTrainer_8c960_00000TERMINATED10.0.18.44:68577 10 76.3089 90.0001 0.000549661
1 .metrics_dataframe
epoch lr train_loss val_loss timestamp time_this_iter_s should_checkpoint done training_iteration trial_id date time_total_s pid hostname node_ip time_since_restore iterations_since_restore
0 0 0.0001 0.005196 0.004071 1689030896 14.162520 True False 1 8c960_00000 2023-07-10_16-14-59 14.162520 68577 ip-10-0-18-44 10.0.18.44 14.162520 1
1 1 0.0001 0.004033 0.003898 1689030905 8.704429 True False 2 8c960_00000 2023-07-10_16-15-08 22.866948 68577 ip-10-0-18-44 10.0.18.44 22.866948 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9 9 0.0001 0.000550 0.001182 1689030958 6.604867 True False 10 8c960_00000 2023-07-10_16-16-01 76.308887 68577 ip-10-0-18-44 10.0.18.44 76.308887 10

We're going to use the search_runs function from the MLflow python API to identify the best run in our experiment so far (we' only done one run so far so it will be the run from above).

1 2 3 sorted_runs = mlflow.search_runs(experiment_names=[experiment_name], order_by=["metrics.val_loss ASC"]) sorted_runs

Once we're done training, we can use the MLflow dashboard to visualize our results. To do so, we'll use the mlflow server command to launch the MLflow dashboard and navigate to the experiment we just created.

View the dashboard

If you're on Anyscale Workspaces , then we need to first expose the port of the MLflow server. Run the following command on your Anyscale Workspace terminal to generate the public URL to your MLflow server.

If you're running this notebook on your local laptop then head on over to http://localhost:8080/ to view your MLflow dashboard.

MLFlow creates a main dashboard with all your experiments and their respective runs. We can sort runs by clicking on the column headers.

mlflow runs

And within each run, we can view metrics, parameters, artifacts, etc.

mlflow params

And we can even create custom plots to help us visualize our results.

mlflow plots

After inspection and once we've identified an experiment that we like, we can load the model for evaluation and inference.

1 2 ray.air import Result from urllib.parse import urlparse

We're going to create a small utility function that uses an MLflow run's artifact path to load a Ray Result object. We'll then use the Result object to load the best checkpoint.

1 2 3 4 get_best_checkpoint(run_id): artifact_dir = urlparse(mlflow.get_run(run_id).info.artifact_uri).path # get path from mlflow results = Result.from_path(artifact_dir) return results.best_checkpoints[0][0]

With a particular run's best checkpoint, we can load the model from it and use it.

1 2 3 4 5 best_checkpoint = get_best_checkpoint(run_id=best_run.run_id) predictor = TorchPredictor.from_checkpoint(best_checkpoint) performance = evaluate(ds=test_ds, predictor=predictor) print (json.dumps(performance, indent=2))

Before we can use our model for inference, we need to load the preprocessor from our predictor and apply it to our input data.

1 2 preprocessor = predictor.get_preprocessor()
1 2 3 4 5 title = "Transfer learning with transformers" description = "Using transformers for transfer learning on text classification tasks." sample_df = pd.DataFrame([{"title": title, "description": description, "tag": "other"}]) predict_with_proba(df=sample_df, predictor=predictor)

In the next lesson we'll learn how to tune our models and use our MLflow dashboard to compare the results.

Upcoming live cohorts

To cite this content, please use:

1 2 3 4 5 6 {madewithml, author = {Goku Mohandas}, title = { Tracking - Made With ML }, howpublished = {\url{https://madewithml.com/}}, year = {2023} }

This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

Evaluate automated machine learning experiment results

  • 23 contributors

In this article, learn how to evaluate and compare models trained by your automated machine learning (automated ML) experiment. Over the course of an automated ML experiment, many jobs are created and each job creates a model. For each model, automated ML generates evaluation metrics and charts that help you measure the model's performance. You can further generate a Responsible AI dashboard to do a holistic assessment and debugging of the recommended best model by default. This includes insights such as model explanations, fairness and performance explorer, data explorer, model error analysis. Learn more about how you can generate a Responsible AI dashboard.

For example, automated ML generates the following charts based on experiment type.

Classification Regression/forecasting

Items marked (preview) in this article are currently in public preview. The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews .

Prerequisites

  • An Azure subscription. (If you don't have an Azure subscription, create a free account before you begin)
  • The Azure Machine Learning studio (no code required)
  • The Azure Machine Learning Python SDK

View job results

After your automated ML experiment completes, a history of the jobs can be found via:

  • A browser with Azure Machine Learning studio
  • A Jupyter notebook using the JobDetails Jupyter widget

The following steps and video, show you how to view the run history and model evaluation metrics and charts in the studio:

  • Sign into the studio and navigate to your workspace.
  • In the left menu, select Jobs .
  • Select your experiment from the list of experiments.
  • In the table at the bottom of the page, select an automated ML job.
  • In the Models tab, select the Algorithm name for the model you want to evaluate.
  • In the Metrics tab, use the checkboxes on the left to view metrics and charts.

Classification metrics

Automated ML calculates performance metrics for each classification model generated for your experiment. These metrics are based on the scikit learn implementation.

Many classification metrics are defined for binary classification on two classes, and require averaging over classes to produce one score for multi-class classification. Scikit-learn provides several averaging methods, three of which automated ML exposes: macro , micro , and weighted .

  • Macro - Calculate the metric for each class and take the unweighted average
  • Micro - Calculate the metric globally by counting the total true positives, false negatives, and false positives (independent of classes).
  • Weighted - Calculate the metric for each class and take the weighted average based on the number of samples per class.

While each averaging method has its benefits, one common consideration when selecting the appropriate method is class imbalance. If classes have different numbers of samples, it might be more informative to use a macro average where minority classes are given equal weighting to majority classes. Learn more about binary vs multiclass metrics in automated ML .

The following table summarizes the model performance metrics that automated ML calculates for each classification model generated for your experiment. For more detail, see the scikit-learn documentation linked in the Calculation field of each metric.

Refer to image metrics section for additional details on metrics for image classification models.

Metric Description Calculation
AUC AUC is the Area under the .

Closer to 1 the better
[0, 1]

Supported metric names include, , the arithmetic mean of the AUC for each class. , computed by counting the total true positives, false negatives, and false positives. , arithmetic mean of the score for each class, weighted by the number of true instances in each class. , the value of AUC by treating one specific class as class and combine all other classes as class.

accuracy Accuracy is the ratio of predictions that exactly match the true class labels.

Closer to 1 the better
[0, 1]
average_precision Average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.

Closer to 1 the better
[0, 1]

Supported metric names include, , the arithmetic mean of the average precision score of each class. , computed by counting the total true positives, false negatives, and false positives. , the arithmetic mean of the average precision score for each class, weighted by the number of true instances in each class. , the value of average precision by treating one specific class as class and combine all other classes as class.
balanced_accuracy Balanced accuracy is the arithmetic mean of recall for each class.

Closer to 1 the better
[0, 1]
f1_score F1 score is the harmonic mean of precision and recall. It's a balanced measure of both false positives and false negatives. However, it doesn't take true negatives into account.

Closer to 1 the better
[0, 1]

Supported metric names include, : the arithmetic mean of F1 score for each class. : computed by counting the total true positives, false negatives, and false positives. : weighted mean by class frequency of F1 score for each class. , the value of f1 by treating one specific class as class and combine all other classes as class.
log_loss This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier's predictions.

Closer to 0 the better
[0, inf)
norm_macro_recall Normalized macro recall is recall macro-averaged and normalized, so that random performance has a score of 0, and perfect performance has a score of 1.

Closer to 1 the better
[0, 1]
 / 

where, is the expected value of for random predictions.

 for  binary classification.
for C-class classification problems.
matthews_correlation Matthews correlation coefficient is a balanced measure of accuracy, which can be used even if one class has many more samples than another. A coefficient of 1 indicates perfect prediction, 0 random prediction, and -1 inverse prediction.

Closer to 1 the better
[-1, 1]
precision Precision is the ability of a model to avoid labeling negative samples as positive.

Closer to 1 the better
[0, 1]

Supported metric names include, , the arithmetic mean of precision for each class. , computed globally by counting the total true positives and false positives. , the arithmetic mean of precision for each class, weighted by number of true instances in each class. , the value of precision by treating one specific class as class and combine all other classes as class.
recall Recall is the ability of a model to detect all positive samples.

Closer to 1 the better
[0, 1]

Supported metric names include, : the arithmetic mean of recall for each class. : computed globally by counting the total true positives, false negatives, and false positives. : the arithmetic mean of recall for each class, weighted by number of true instances in each class. , the value of recall by treating one specific class as class and combine all other classes as class.
weighted_accuracy Weighted accuracy is accuracy where each sample is weighted by the total number of samples belonging to the same class.

Closer to 1 the better
[0, 1]

Binary vs. multiclass classification metrics

Automated ML automatically detects if the data is binary and also allows users to activate binary classification metrics even if the data is multiclass by specifying a true class. Multiclass classification metrics are reported if a dataset has two or more classes. Binary classification metrics are reported only when the data is binary.

Note, multiclass classification metrics are intended for multiclass classification. When applied to a binary dataset, these metrics don't treat any class as the true class, as you might expect. Metrics that are clearly meant for multiclass are suffixed with micro , macro , or weighted . Examples include average_precision_score , f1_score , precision_score , recall_score , and AUC . For example, instead of calculating recall as tp / (tp + fn) , the multiclass averaged recall ( micro , macro , or weighted ) averages over both classes of a binary classification dataset. This is equivalent to calculating the recall for the true class and the false class separately, and then taking the average of the two.

Besides, although automatic detection of binary classification is supported, it's still recommended to always specify the true class manually to make sure the binary classification metrics are calculated for the correct class.

To activate metrics for binary classification datasets when the dataset itself is multiclass, users only need to specify the class to be treated as true class and these metrics are calculated.

Confusion matrix

Confusion matrices provide a visual for how a machine learning model is making systematic errors in its predictions for classification models. The word "confusion" in the name comes from a model "confusing" or mislabeling samples. A cell at row i and column j in a confusion matrix contains the number of samples in the evaluation dataset that belong to class C_i and are classified by the model as class C_j .

In the studio, a darker cell indicates a higher number of samples. Selecting Normalized view in the dropdown normalizes over each matrix row to show the percent of class C_i predicted to be class C_j . The benefit of the default Raw view is that you can see whether imbalance in the distribution of actual classes caused the model to misclassify samples from the minority class, a common issue in imbalanced datasets.

The confusion matrix of a good model has most samples along the diagonal.

Confusion matrix for a good model

Confusion matrix for a good model

Confusion matrix for a bad model

Confusion matrix for a bad model

The receiver operating characteristic (ROC) curve plots the relationship between true positive rate (TPR) and false positive rate (FPR) as the decision threshold changes. The ROC curve can be less informative when training models on datasets with high class imbalance, as the majority class can drown out contributions from minority classes.

The area under the curve (AUC) can be interpreted as the proportion of correctly classified samples. More precisely, the AUC is the probability that the classifier ranks a randomly chosen positive sample higher than a randomly chosen negative sample. The shape of the curve gives an intuition for relationship between TPR and FPR as a function of the classification threshold or decision boundary.

A curve that approaches the top-left corner of the chart is approaching a 100% TPR and 0% FPR, the best possible model. A random model would produce an ROC curve along the y = x line from the bottom-left corner to the top-right. A worse than random model would have an ROC curve that dips below the y = x line.

For classification experiments, each of the line charts produced for automated ML models can be used to evaluate the model per-class or averaged over all classes. You can switch between these different views by clicking on class labels in the legend to the right of the chart.

ROC curve for a good model

ROC curve for a good model

ROC curve for a bad model

ROC curve for a bad model

Precision-recall curve

The precision-recall curve plots the relationship between precision and recall as the decision threshold changes. Recall is the ability of a model to detect all positive samples and precision is the ability of a model to avoid labeling negative samples as positive. Some business problems might require higher recall and some higher precision depending on the relative importance of avoiding false negatives vs false positives.

Precision-recall curve for a good model

Precision-recall curve for a good model

Precision-recall curve for a bad model

Precision-recall curve for a bad model

Cumulative gains curve

The cumulative gains curve plots the percent of positive samples correctly classified as a function of the percent of samples considered where we consider samples in the order of predicted probability.

To calculate gain, first sort all samples from highest to lowest probability predicted by the model. Then take x% of the highest confidence predictions. Divide the number of positive samples detected in that x% by the total number of positive samples to get the gain. Cumulative gain is the percent of positive samples we detect when considering some percent of the data that is most likely to belong to the positive class.

A perfect model ranks all positive samples above all negative samples giving a cumulative gains curve made up of two straight segments. The first is a line with slope 1 / x from (0, 0) to (x, 1) where x is the fraction of samples that belong to the positive class ( 1 / num_classes if classes are balanced). The second is a horizontal line from (x, 1) to (1, 1) . In the first segment, all positive samples are classified correctly and cumulative gain goes to 100% within the first x% of samples considered.

The baseline random model has a cumulative gains curve following y = x where for x% of samples considered only about x% of the total positive samples were detected. A perfect model for a balanced dataset has a micro average curve and a macro average line that has slope num_classes until cumulative gain is 100% and then horizontal until the data percent is 100.

Cumulative gains curve for a good model

Cumulative gains curve for a good model

Cumulative gains curve for a bad model

Cumulative gains curve for a bad model

The lift curve shows how many times better a model performs compared to a random model. Lift is defined as the ratio of cumulative gain to the cumulative gain of a random model (which should always be 1 ).

This relative performance takes into account the fact that classification gets harder as you increase the number of classes. (A random model incorrectly predicts a higher fraction of samples from a dataset with 10 classes compared to a dataset with two classes)

The baseline lift curve is the y = 1 line where the model performance is consistent with that of a random model. In general, the lift curve for a good model is higher on that chart and farther from the x-axis, showing that when the model is most confident in its predictions it performs many times better than random guessing.

Lift curve for a good model

Lift curve for a good model

Lift curve for a bad model

Lift curve for a bad model

Calibration curve

The calibration curve plots a model's confidence in its predictions against the proportion of positive samples at each confidence level. A well-calibrated model will correctly classify 100% of the predictions to which it assigns 100% confidence, 50% of the predictions it assigns 50% confidence, 20% of the predictions it assigns a 20% confidence, and so on. A perfectly calibrated model has a calibration curve following the y = x line where the model perfectly predicts the probability that samples belong to each class.

An over-confident model over-predicts probabilities close to zero and one, rarely being uncertain about the class of each sample and the calibration curve will look similar to backward "S." An under-confident model assigns a lower probability on average to the class it predicts and the associated calibration curve looks similar to an "S". The calibration curve doesn't depict a model's ability to classify correctly, but instead its ability to correctly assign confidence to its predictions. A bad model can still have a good calibration curve if the model correctly assigns low confidence and high uncertainty.

The calibration curve is sensitive to the number of samples, so a small validation set can produce noisy results that can be hard to interpret. This does not necessarily mean that the model is not well-calibrated.

Calibration curve for a good model

Calibration curve for a good model

Calibration curve for a bad model

Calibration curve for a bad model

Regression/forecasting metrics

Automated ML calculates the same performance metrics for each model generated, regardless if it's a regression or forecasting experiment. These metrics also undergo normalization to enable comparison between models trained on data with different ranges. To learn more, see metric normalization .

The following table summarizes the model performance metrics generated for regression and forecasting experiments. Like classification metrics, these metrics are also based on the scikit learn implementations. The appropriate scikit learn documentation is linked accordingly, in the Calculation field.

Metric Description Calculation
explained_variance Explained variance measures the extent to which a model accounts for the variation in the target variable. It's the percent decrease in variance of the original data to the variance of the errors. When the mean of the errors is 0, it's equal to the coefficient of determination (see r2_score in following chart).

Closer to 1 the better
(-inf, 1]
mean_absolute_error Mean absolute error is the expected value of absolute value of difference between the target and the prediction.

Closer to 0 the better
[0, inf)

Types:

, the mean_absolute_error divided by the range of the data.
mean_absolute_percentage_error Mean absolute percentage error (MAPE) is a measure of the average difference between a predicted value and the actual value.

Closer to 0 the better
[0, inf)
median_absolute_error Median absolute error is the median of all absolute differences between the target and the prediction. This loss is robust to outliers.

Closer to 0 the better
[0, inf)

Types:

: the median_absolute_error divided by the range of the data.
r2_score R (the coefficient of determination) measures the proportional reduction in mean squared error (MSE) relative to the total variance of the observed data.

Closer to 1 the better
[-1, 1]

Note: R often has the range (-inf, 1]. The MSE can be larger than the observed variance, so R can have arbitrarily large negative values, depending on the data and the model predictions. Automated ML clips reported R scores at -1, so a value of -1 for R likely means that the true R score is less than -1. Consider the other metrics values and the properties of the data when interpreting a negative R score.
root_mean_squared_error Root mean squared error (RMSE) is the square root of the expected squared difference between the target and the prediction. For an unbiased estimator, RMSE is equal to the standard deviation.

Closer to 0 the better
[0, inf)

Types:

: the root_mean_squared_error divided by the range of the data.
root_mean_squared_log_error Root mean squared log error is the square root of the expected squared logarithmic error.

Closer to 0 the better
[0, inf)

Types:

: the root_mean_squared_log_error divided by the range of the data.
spearman_correlation Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation doesn't assume that both datasets are normally distributed. Like other correlation coefficients, Spearman varies between -1 and 1 with 0 implying no correlation. Correlations of -1 or 1 imply an exact monotonic relationship.

Spearman is a rank-order correlation metric meaning that changes to predicted or actual values won't change the Spearman result if they don't change the rank order of predicted or actual values.

Closer to 1 the better
[-1, 1]

Metric normalization

Automated ML normalizes regression and forecasting metrics, which enable comparison between models trained on data with different ranges. A model trained on a data with a larger range has higher error than the same model trained on data with a smaller range, unless that error is normalized.

While there's no standard method of normalizing error metrics, automated ML takes the common approach of dividing the error by the range of the data: normalized_error = error / (y_max - y_min)

The range of data is not saved with the model. If you do inference with the same model on a holdout test set, y_min and y_max may change according to the test data and the normalized metrics may not be directly used to compare the model's performance on training and test sets. You can pass in the value of y_min and y_max from your training set to make the comparison fair.

Forecasting metrics: normalization and aggregation

Calculating metrics for forecasting model evaluation requires some special considerations when the data contains multiple time series. There are two natural choices for aggregating metrics over multiple series:

  • A macro average wherein the evaluation metrics from each series are given equal weight,
  • A micro average wherein evaluation metrics for each prediction have equal weight.

These cases have direct analogies to macro and micro averaging in multi-class classification .

The distinction between macro and micro averaging can be important when selecting a primary metric for model selection. For example, consider a retail scenario where you want to forecast demand for a selection of consumer products. Some products sell at higher volumes than others. If you choose a micro-averaged RMSE as the primary metric, it's possible that the high-volume items contribute most of the modeling error and, so, dominate the metric. The model selection algorithm might favor models with higher accuracy on the high-volume items than on the low-volume ones. In contrast, a macro-averaged, normalized RMSE gives low-volume items approximately equal weight to the high-volume items.

The following table lists which of AutoML's forecasting metrics use macro vs. micro averaging:

Macro averaged Micro averaged
, , , , , , , , , ,

Note that macro-averaged metrics normalize each series separately . The normalized metrics from each series are then averaged to give the final result. The correct choice of macro vs. micro depends on the business scenario, but we generally recommend using normalized_root_mean_squared_error .

The residuals chart is a histogram of the prediction errors (residuals) generated for regression and forecasting experiments. Residuals are calculated as y_predicted - y_true for all samples and then displayed as a histogram to show model bias.

In this example, both models are slightly biased to predict lower than the actual value. This isn't uncommon for a dataset with a skewed distribution of actual targets, but indicates worse model performance. A good model has a residuals distribution that peaks at zero with few residuals at the extremes. A worse model has a spread out residuals distribution with fewer samples around zero.

Residuals chart for a good model

Residuals chart for a good model

Residuals chart for a bad model

Residuals chart for a bad model

Predicted vs. true

For regression and forecasting experiment the predicted vs. true chart plots the relationship between the target feature (true/actual values) and the model's predictions. The true values are binned along the x-axis and for each bin the mean predicted value is plotted with error bars. This allows you to see if a model is biased toward predicting certain values. The line displays the average prediction and the shaded area indicates the variance of predictions around that mean.

Often, the most common true value has the most accurate predictions with the lowest variance. The distance of the trend line from the ideal y = x line where there are few true values is a good measure of model performance on outliers. You can use the histogram at the bottom of the chart to reason about the actual data distribution. Including more data samples where the distribution is sparse can improve model performance on unseen data.

In this example, note that the better model has a predicted vs. true line that is closer to the ideal y = x line.

Predicted vs. true chart for a good model

Predicted vs. true chart for a good model

Predicted vs. true chart for a bad model

Predicted vs. true chart for a bad model

Forecast horizon

For forecasting experiments, the forecast horizon chart plots the relationship between the models predicted value and the actual values mapped over time per cross validation fold, up to five folds. The x axis maps time based on the frequency you provided during training setup. The vertical line in the chart marks the forecast horizon point also referred to as the horizon line, which is the time period at which you would want to start generating predictions. To the left of the forecast horizon line, you can view historic training data to better visualize past trends. To the right of the forecast horizon, you can visualize the predictions (the purple line) against the actuals (the blue line) for the different cross validation folds and time series identifiers. The shaded purple area indicates the confidence intervals or variance of predictions around that mean.

You can choose which cross validation fold and time series identifier combinations to display by clicking the edited pencil icon on the top right corner of the chart. Select from the first five cross validation folds and up to 20 different time series identifiers to visualize the chart for your various time series.

This chart is available in the training run for models generated from training and validation data as well as in the test run based on training data and test data. We allow up to 20 data points before and up to 80 data points after the forecast origin. For DNN models, this chart in the training run shows data from the last epoch i.e. after the model has been trained completely. This chart in the test run can have gap before the horizon line if validation data was explicitly provided during the training run. This is becasue training data and test data is used in the test run leaving out the validation data which results in gap.

Forecast horizon chart

Metrics for image models (preview)

Automated ML uses the images from the validation dataset for evaluating the performance of the model. The performance of the model is measured at an epoch-level to understand how the training progresses. An epoch elapses when an entire dataset is passed forward and backward through the neural network exactly once.

Image classification metrics

The primary metric for evaluation is accuracy for binary and multi-class classification models and IoU ( Intersection over Union ) for multilabel classification models. The classification metrics for image classification models are same as those defined in the classification metrics section. The loss values associated with an epoch are also logged which can help monitor how the training progresses and determine if the model is over-fitting or under-fitting.

Every prediction from a classification model is associated with a confidence score, which indicates the level of confidence with which the prediction was made. Multilabel image classification models are by default evaluated with a score threshold of 0.5, which means only predictions with at least this level of confidence is considered as a positive prediction for the associated class. Multiclass classification doesn't use a score threshold but instead, the class with the maximum confidence score is considered as the prediction.

Epoch-level metrics for image classification

Unlike the classification metrics for tabular datasets, image classification models log all the classification metrics at an epoch-level as shown below.

Epoch-level charts for image classification

Summary metrics for image classification

Apart from the scalar metrics that are logged at the epoch level, image classification model also log summary metrics like confusion matrix , classification charts including ROC curve, precision-recall curve and classification report for the model from the best epoch at which we get the highest primary metric (accuracy) score.

Classification report provides the class-level values for metrics like precision, recall, f1-score, support, auc, and average_precision with various level of averaging - micro, macro, and weighted as shown below. Refer to the metrics definitions from the classification metrics section.

Classification report for image classification

Object detection and instance segmentation metrics

Every prediction from an image object detection or instance segmentation model is associated with a confidence score. The predictions with confidence score greater than score threshold are output as predictions and used in the metric calculation, the default value of which is model specific and can be referred from the hyperparameter tuning page( box_score_threshold hyperparameter).

The metric computation of an image object detection and instance segmentation model is based on an overlap measurement defined by a metric called IoU ( Intersection over Union ) which is computed by dividing the area of overlap between the ground-truth and the predictions by the area of union of the ground-truth and the predictions. The IoU computed from every prediction is compared with an overlap threshold called an IoU threshold, which determines how much a prediction should overlap with a user-annotated ground-truth in order to be considered as a positive prediction. If the IoU computed from the prediction is less than the overlap threshold the prediction wouldn't be considered as a positive prediction for the associated class.

The primary metric for the evaluation of image object detection and instance segmentation models is the mean average precision (mAP) . The mAP is the average value of the average precision(AP) across all the classes. Automated ML object detection models support the computation of mAP using the below two popular methods.

Pascal VOC metrics :

Pascal VOC mAP is the default way of mAP computation for object detection/instance segmentation models. Pascal VOC style mAP method calculates the area under a version of the precision-recall curve. First p(rᵢ), which is precision at recall i is computed for all unique recall values. p(rᵢ) is then replaced with maximum precision obtained for any recall r' >= rᵢ. The precision value is monotonically decreasing in this version of the curve. Pascal VOC mAP metric is by default evaluated with an IoU threshold of 0.5. A detailed explanation of this concept is available in this blog .

COCO metrics :

COCO evaluation method uses a 101-point interpolated method for AP calculation along with averaging over ten IoU thresholds. AP@[.5:.95] corresponds to the average AP for IoU from 0.5 to 0.95 with a step size of 0.05. Automated ML logs all the 12 metrics defined by the COCO method including the AP and AR(average recall) at various scales in the application logs while the metrics user interface shows only the mAP at an IoU threshold of 0.5.

The image object detection model evaluation can use coco metrics if the validation_metric_type hyperparameter is set to be 'coco' as explained in the hyperparameter tuning section.

Epoch-level metrics for object detection and instance segmentation

The mAP, precision, and recall values are logged at an epoch-level for image object detection/instance segmentation models. The mAP, precision, and recall metrics are also logged at a class level with the name 'per_label_metrics'. The 'per_label_metrics' should be viewed as a table.

Epoch-level metrics for precision, recall and per_label_metrics are not available when using the 'coco' method.

Epoch-level charts for object detection

Responsible AI dashboard for best recommended AutoML model (preview)

The Azure Machine Learning Responsible AI dashboard provides a single interface to help you implement Responsible AI in practice effectively and efficiently. Responsible AI dashboard is only supported using tabular data and is only supported on classification and regression models. It brings together several mature Responsible AI tools in the areas of:

  • Model performance and fairness assessment
  • Data exploration
  • Machine learning interpretability
  • Error analysis

While model evaluation metrics and charts are good for measuring the general quality of a model, operations such as inspecting the model’s fairness, viewing its explanations (also known as which dataset features a model used to make its predictions), inspecting its errors and potential blind spots are essential when practicing responsible AI. That's why automated ML provides a Responsible AI dashboard to help you observe various insights for your model. See how to view the Responsible AI dashboard in the Azure Machine Learning studio.

See how you can generate this dashboard via the UI or the SDK.

Model explanations and feature importances

While model evaluation metrics and charts are good for measuring the general quality of a model, inspecting which dataset features a model uses to make predictions is essential when practicing responsible AI. That's why automated ML provides a model explanations dashboard to measure and report the relative contributions of dataset features. See how to view the explanations dashboard in the Azure Machine Learning studio .

Interpretability, best model explanation, is not available for automated ML forecasting experiments that recommend the following algorithms as the best model or ensemble:

  • TCNForecaster
  • ExponentialSmoothing
  • Seasonal Average
  • Seasonal Naive
  • Try the automated machine learning model explanation sample notebooks .
  • For automated ML specific questions, reach out to [email protected].

Was this page helpful?

Additional resources

  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

Steps to Build a Machine Learning Model

In today’s era of a data-rich environment where data generation volume, velocity, and variety are unparalleled, we face both opportunities and challenges. Machine learning models offer a powerful mechanism to extract meaningful patterns, trends, and insights from this vast pool of data, giving us the power to make better-informed decisions and appropriate actions. In this article, we will explore the Fundamentals of Machine Learning and the Steps to build a Machine Learning Model.

Steps-to-build-a-Machine-Learning-Model

Table of Content

Understanding the Fundamentals of Machine Learning

Comprehensive guide to building a machine learning model, step 1: data collection for machine learning, step 2: preprocessing and preparing your data, step 3: selecting the right machine learning model, step 4: training your machine learning model, step 5: evaluating model performance, step 6: tuning and optimizing your model, step 7: deploying the model and making predictions.

Machine learning is the field of study that enables computers to learn from data and make decisions without explicit programming. Machine learning models play a pivotal role in tackling real-world problems across various domains by affecting our approach to tackling problems and decision-making. By using data-driven insights and sophisticated algorithms, machine learning models help us achieve unparalleled accuracy and efficiency in solving real-world problems.

Machine learning is crucial in today’s data-driven world, where the ability to extract insights and make predictions from vast amounts of data can help significant advancement in any field thus understanding its fundamentals becomes crucial.

We can see machine learning as a subset or just a part of artificial intelligence that focuses on developing algorithms that are capable of learning hidden patterns and relationships within the data allowing algorithms to generalize and make better predictions or decisions on new data. To achieve this we have several key concepts and techniques like supervised learning , unsupervised learning , and reinforcement learning.

  • Supervised learning involves training a model on labeled data, where the algorithm learns from the input data and its corresponding target ( output labels). The goal is to map from input to output, allowing the model to learn the relationship and make predictions based on the learnings of new data. Some of its algorithms are linear regression , logistic regression decision trees , and more.
  • Unsupervised learning , on the other hand, deals with the unlabeled dataset where algorithms try to uncover hidden patterns or structures within the data. Unlike supervised learning which depends on labeled data to create patterns or relationships for further predictions, unsupervised learning operates without such guidance. Some of its algorithms are, Clustering algorithms like k-means , hierarchical clustering dimensionality reduction algorithms like PCA , and more.
  • Reinforcement learning is a part of machine learning that involves training an agent to interact with an environment and learn optimal actions through trial and error. It employs a reward-penalty strategy, the agent receives feedback in the form of rewards or penalties based on its actions, allowing it to learn from experience and maximize its reward over time. Reinforcement learning applications in areas such as robotics, games, and more.

Some of the key terminologies of ML before building one are:

  • Feature : Features are the pieces of information that we use to train our model to make predictions. In simpler terms, they are the columns or attributes of the dataset that contain the data used for analysis and modeling.
  • Label : The output or target variable that the model aims to predict in supervised learning, also known as the dependent variable.
  • Training set : The portion of the dataset that is used to train the machine learning model. The model learns patterns and relationships in the data from the training set.
  • Validation set : A subset of the dataset that is used to tune the model’s hyperparameters and helps in assessing performance during training of the model.
  • Test Set : It is also a part of the dataset that is used to evaluate our final model performance on unseen data.

This comprehensive guide will take you through the process of building a machine-learning model, covering everything from data preprocessing to model evaluation and deployment. By following these steps, you’ll learn how to create a robust machine-learning model that meets your needs. Let’s see these steps,

Data collection is a crucial step in the creation of a machine learning model, as it lays the foundation for building accurate models. In this phase of machine learning model development, relevant data is gathered from various sources to train the machine learning model and enable it to make accurate predictions. The first step in data collection is defining the problem and understanding the requirements of the machine learning project. This usually involves determining the type of data we need for our project like structured or unstructured data , and identifying potential sources for gathering data.

Once the requirements are finalized, data can be collected from a variety of sources such as databases, APIs, web scraping , and manual data entry. It is crucial to ensure that the collected data is both relevant and accurate, as the quality of the data directly impacts the generalization ability of our machine learning model. In other words, the better the quality of the data, the better the performance and reliability of our model in making predictions or decisions.

Preprocessing and preparing data is an important step that involves transforming raw data into a format that is suitable for training and testing for our models. This phase aims to clean i.e. remove null values, and garbage values, and normalize and preprocess the data to achieve greater accuracy and performance of our machine learning models.

As Clive Humby said, “Data is the new oil. It’s valuable, but if unrefined it cannot be used.” This quote emphasizes the importance of refining data before using it for analysis or modeling. Just like oil needs to be refined to unlock its full potential, raw data must undergo preprocessing to enable its effective utilization in ML tasks. The preprocessing process typically involves several steps, including handling missing values, encoding categorical variables i.e. converting into numerical, scaling numerical features, and feature engineering. This ensures that the model’s performance is optimized and also our model can generalize well to unseen data and finally get accurate predictions.

Selecting the right machine learning model plays a pivotal role in building of successful model, with the presence of numerous algorithms and techniques available easily, choosing the most suitable model for a given problem significantly impacts the accuracy and performance of the model. The process of selecting the right machine learning model involves several considerations, some of which are:

Firstly, understanding the nature of the problem is an essential step, as our model nature can be of any type like classification , regression , clustering or more, different types of problems require different algorithms to make a predictive model.

Secondly, familiarizing yourself with a variety of machine learning algorithms suitable for your problem type is crucial. Evaluate the complexity of each algorithm and its interpretability. We can also explore more complex models like deep learning may help in increasing your model performance but are complex to interpret. The best approach is often to experiment with multiple models evaluate their metrics and iteratively check how well each of the algorithms is generalizing to unseen data.

In this phase of building a machine learning model, we have all the necessary ingredients to train our model effectively. This involves utilizing our prepared data to teach the model to recognize patterns and make predictions based on the input features. During the training process, we begin by feeding the preprocessed data into the selected machine-learning algorithm . The algorithm then iteratively adjusts its internal parameters to minimize the difference between its predictions and the actual target values in the training data. This optimization process often employs techniques like gradient descent.

As the model learns from the training data, it gradually improves its ability to generalize to new or unseen data. This iterative learning process enables the model to become more adept at making accurate predictions across a wide range of scenarios.

Once you have trained your model, it’s time to assess its performance. There are various metrics used to evaluate model performance, categorized based on the type of task: regression/numerical or classification.

1. For regression tasks, common evaluation metrics are:

  • Mean Absolute Error (MAE): MAE is the average of the absolute differences between predicted and actual values.
  • Mean Squared Error (MSE): MSE is the average of the squared differences between predicted and actual values.
  • Root Mean Squared Error ( RMSE ): It is a square root of the MSE , providing a measure of the average magnitude of error.
  • R-squared (R2): It is the proportion of the variance in the dependent variable that is predictable from the independent variables.

2. For classification tasks, common evaluation metrics are:

  • Accuracy: Proportion of correctly classified instances out of the total instances.
  • Precision: Proportion of true positive predictions among all positive predictions.
  • Recall: Proportion of true positive predictions among all actual positive instances.
  • F1-score: Harmonic mean of precision and recall, providing a balanced measure of model performance.
  • Area Under the Receiver Operating Characteristic curve (AUC-ROC): Measure of the model’s ability to distinguish between classes.
  • Confusion Metrics: It is a matrix that summarizes the performance of a classification model, showing counts of true positives, true negatives, false positives, and false negatives instances.

By evaluating the model using these metrics, one can gain insights into the strengths and weaknesses of our model allowing us to use further refinement and optimization.

As we have trained our model, our next step is to optimize our model more. Tuning and optimizing helps our model to maximize its performance and generalization ability. This process involves fine-tuning hyperparameters , selecting the best algorithm, and improving features through feature engineering techniques. Hyperparameters are parameters that are set before the training process begins and control the behavior of the machine learning model. These are like learning rate, regularization and parameters of the model should be carefully adjusted.

Techniques like grid search cv randomized search and cross-validation are some optimization techniques that are used to systematically explore the hyperparameter space and identify the best combination of hyperparameters for the model. Overall, tuning and optimizing the model involves a combination of careful speculation of parameters, feature engineering, and other techniques to create a highly generalized model.

Deploying the model and making predictions is the final stage in the journey of creating an ML model. Once a model has been trained and optimized, it’s to integrate it into a production environment where it can provide real-time predictions on new data.

During model deployment, it’s essential to ensure that the system can handle high user loads, operate smoothly without crashes, and be easily updated. Tools like Docker and Kubernetes help make this process easier by packaging the model in a way that makes it easy to run on different computers and manage efficiently. Once deployment is done our model is ready to predict new data, which involves feeding unseen data into the deployed model to enable real-time decision making.

In conclusion, building a machine learning model involves collecting and preparing data, selecting the right algorithm, tuning it, evaluating its performance, and deploying it for real-time decision-making. Through these steps, we can refine the model to make accurate predictions and contribute to solving real-world problems.

Please Login to comment...

Similar reads.

  • AI-ML-DS Blogs
  • Machine Learning Blogs
  • Top Android Apps for 2024
  • Top Cell Phone Signal Boosters in 2024
  • Best Travel Apps (Paid & Free) in 2024
  • The Best Smart Home Devices for 2024
  • 15 Most Important Aptitude Topics For Placements [2024]

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Help us learn about your current experience with the documentation. Take the survey !

Machine learning model experiments

What is an experiment, model candidate, track new experiments and candidates, explore model candidates, view log artifacts, view ci information, related topics, help & feedback, feature availability and product trials.

For problems setting up or using this feature (depending on your GitLab subscription).

experiment machine learning model

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Open Source ML Model Versioning, Metadata, and Experiment Management

VertaAI/modeldb

Folders and files.

NameName
4,114 Commits
modeldb modeldb

Repository files navigation

Modeldb: an open-source system for machine learning model versioning, metadata, and experiment management..

docker hub

Quickstart · Workflow · Examples · Contribute · Support (Slack)

ModelDB is an open-source system to version machine learning models including their ingredients code, data, config, and environment and to track ML metadata across the model lifecycle.

Use ModelDB in order to:

  • Make your ML models reproducible
  • Manage your ML experiments, build performance dashboards, and share reports
  • Track models across their lifecycle including development, deployment, and live monitoring
  • Works on Docker, Kubernetes
  • Clients in Python and Scala
  • Beautiful dashboards for model performance and reporting
  • Git-like operations on any model
  • Flexible metadata logging including metrics, artifacts, tags and user information
  • Pluggable storage systems
  • Integration into state-of-the-art frameworks like Tensorflow and PyTorch
  • Battle-tested in production environments
  • If you are looking for a hosted version of ModelDB, please reach out at [email protected] .

This version of ModelDB is built upon its predecessor from CSAIL, MIT . The previous version can be found on Github here . The ModelDB project is now maintained by Verta.ai .

What’s In This Document

Up and running in 5 minutes, documentation, architecture.

  • How to Contribute
  • Thanks to Our Contributors

Install Docker (and Docker Compose)

Setup ModelDB via Docker Compose

Note: modeldb-backend service needs backend/config/config.yaml to run, either clone the repo before running docker-compose or create the file manually.

  • Install the ModelDB pip package. Note it comes packaged in the verta package.
  • Version a model or log a workflow. Alternatively, run any of the detailed examples in our repository.

That's it! Navigate to http://localhost:3000 to find the ModelDB Web UI and check out the models you just logged.

For information on debugging the Docker-based ModelDB installation, check here .

Other ways to install ModelDB are:

  • Building the source code and deploying
  • Deploy on kubernetes via helm
  • Using a ModelDB ami

Official documentation for ModelDB can be found here .

For Getting Started guides, Tutorials, and API reference check out our docs .

To report a bug, file a documentation issue, or submit a feature request, please open a GitHub issue.

For help, questions, contribution discussions and release announcements, please join us on Slack .

At a high level the architecture of ModelDB in a Kubernetes cluster or a Docker application looks as below:

image

  • ModelDB Client available in Python and Scala which can instantiated in the user's model building code and exposes functions to store information to ModelDB.
  • It receives the request from client (1) and the browser and route them to the appropriate container.
  • The gRPC calls (2) for creating, reading,updating or deleting Projects, Experiments, ExperimentRuns, Dataset, DatasetVersions or their metadata are routed to ModelDB Proxy.
  • The HTTP calls (3) for storing and retrieving binary artifacts are forwarded directly to backend.
  • It receives the gRPC request from the front end (2) and sends them to backend (4). In the other direction it converts the response from backend and sends it to the frontend.
  • It exposes gRPC endpoints (4) for most of the operations which is used by the proxy.
  • It has http endpoints (3) for storing, retrieving and deleting artifacts used directly by the frontend.
  • Out of the box ModelDB is configured and verified to work against PostgreSQL, but since it uses Hibernate as a ORM and liquibase for change management, it should be easy to configure ModelDB to run on another SQL Database supported by the the tools.

Volumes : The relational database and the artifact store in backend need volumes attached to enable persistent storage.

Repo Structure

Each module in the architecture diagram has a designated folder in this repository, and has their own README covering in depth documentation and contribution guidelines.

  • protos has the protobuf definitions of the objects and endpoint used across ModelDB. More details here .
  • backend has the source code and tests for ModelDB Backend. It also holds the proxy at backend/proxy . More details here .
  • client has the source code and tests for ModelDB client. More details here .
  • webapp has the source and tests for ModelDB frontend. More details here .

Other supporting material for deployment and documentation is at:

  • chart has the helm chart to deploy ModelDB onto your Kubernetes cluster. More details here .
  • doc-resources has images for documentation.

Contributions

As seen from the Architecture ModelDB provides a full stack solution to tracking, versioning and auditing machine learning models. We are open to contributions to any of the modules in form of Pull Requests.

The main skill sets for each module are as below:

  • backend: If you are interested in Java development or are interested in database design using technologies like Hibernate and Liquibase please take a look at backend README for setup and development instructions.
  • client: If you are interested in Python or Scala development or are interested in building examples notebooks on various ML frameworks logging data to Modeldb please take a look at client README .
  • frontend: If you are interested in Node , React or Redux based development please take a look at webapp README

Please reach out to us in slack for any assistance in getting started with the development set up or any other feedback.

ModelDB is licensed under Apache 2.0.

Thanks to our many contributors and users.

Contributors 52

  • Python 27.4%
  • TypeScript 17.9%
  • Scala 16.3%
  • JavaScript 1.6%

How to put machine learning models into production

The goal of building a machine learning model is to solve a problem, and a machine learning model can only do so when it is in production and actively in use by consumers. As such, model deployment is as important as model building.

Article hero image

Data scientists excel at creating models that represent and predict real-world data, but effectively deploying machine learning models is more of an art than science. Deployment requires skills more commonly found in software engineering and DevOps. Venturebeat reports that 87% of data science projects never make it to production, while redapt claims it is 90%. Both highlight that a critical factor which makes the difference between success and failure is the ability to collaborate and iterate as a team.

The goal of building a machine learning model is to solve a problem, and a machine learning model can only do so when it is in production and actively in use by consumers. As such, model deployment is as important as model building. As Redapt points out, there can be a “disconnect between IT and data science. IT tends to stay focused on making things available and stable. They want uptime at all costs. Data scientists, on the other hand, are focused on iteration and experimentation. They want to break things.” Bridging the gap between those two worlds is key to ensuring you have a good model and can actually put it into production.

Most data scientists feel that model deployment is a software engineering task and should be handled by software engineers because the required skills are more closely aligned with their day-to-day work. While this is somewhat true, data scientists who learn these skills will have an advantage, especially in lean organizations. Tools like TFX , Mlflow , Kubeflow can simplify the whole process of model deployment, and data scientists can (and should) quickly learn and use them.

The difficulties in model deployment and management have given rise to a new, specialized role: the machine learning engineer. Machine learning engineers are closer to software engineers than typical data scientists, and as such, they are the ideal candidate to put models into production. But not every company has the luxury of hiring specialized engineers just to deploy models. For today’s lean engineering shop, it is advisable that data scientists learn how to get their models into production.

In all this, another question looms — what is the most effective way to put machine learning models into production?

This question is critical, because machine learning promises lots of potential for businesses, and any company that can quickly and effectively get their models to production can outshine their competitors.

In this article, I’m going to talk about some of the practices and methods that will help get machine learning models in production. I’ll discuss different techniques and use cases, as well as the pros and cons of each method.

So without wasting any more time, let’s get to it!

From model to production

Many teams embark on machine learning projects without a production plan, an approach that often leads to serious problems when it's time to deploy. It is both expensive and time-consuming to create models, and you should not invest in an ML project if you have no plan to put it in production, except of course when doing pure research. With a plan in hand, you won’t be surprised by any pitfalls that could derail your launch.

There are three key areas your team needs to consider before embarking on any ML projects are:

Data storage and retrieval

Frameworks and tooling, feedback and iteration.

A machine learning model is of no use to anyone if it doesn’t have any data associated with it. You’ll likely have training, evaluation, testing, and even prediction data sets. You need to answer questions like:

  • How is your training data stored?

How large is your data?

How will you retrieve the data for training?

How will you retrieve data for prediction?

These questions are important as they will guide you on what frameworks or tools to use, how to approach your problem, and how to design your ML model. Before you do anything else in a machine learning project, think about these data questions.

Data can be stored in on-premise, in cloud storage, or in a hybrid of the two. It makes sense to store your data where the model training will occur and the results will be served: on-premise model training and serving will be best suited for on-premise data especially if the data is large, while data stored in cloud storage systems like GCS, AWS S3, or Azure storage should be matched with cloud ML training and serving.

The size of your data also matters a lot. If your dataset is large, then you need more computing power for preprocessing steps as well as model optimization phases. This means you either have to plan for more compute if you’re operating locally, or set up auto-scaling in a cloud environment from the start. Remember, either of these can get expensive if you haven’t thought through your data needs, so pre-plan to make sure your budget can support the model through both training and production

Even if you have your training data stored together with the model to be trained, you still need to consider how that data will be retrieved and processed. Here the question of batch vs. real-time data retrieval comes to mind, and this has to be considered before designing the ML system. Batch data retrieval means that data is retrieved in chunks from a storage system while real-time data retrieval means that data is retrieved as soon as it is available.

Along with training data retrieval, you will also need to think about prediction data retrieval. Your prediction data is rarely as neatly packaged as the training data, so you need to consider a few more issues related to how your model will receive data at inference time:

  • Are you getting inference data from webpages?
  • Are you receiving prediction requests from APIs?
  • Are you making batch or real-time predictions?

If you’re getting data from webpages, the question then is what type of data? Data from users in webpages could be structured data (CSVs, JSON) or unstructured data (images, videos, sound), and the inference engine should be robust enough to retrieve, process, and to make predictions. Inference data from web pages may be very sensitive to users, and as such, you must take into consideration things like privacy and ethics. Here, frameworks like Federated Learning , where the model is brought to the data and the data never leaves webpages/users, can be considered.

Another issue here has to do with data quality. Data used for inference will often be very different from training data, especially when it is coming directly from end-users not APIs. Therefore you must provide the necessary infrastructure to fully automate the detection of changes as well as the processing of this new data.

As with retrieval, you need to consider whether inference is done in batches or in real-time. These two scenarios require different approaches, as the technology/skill involved may be different. For batch inference, you might want to save a prediction request to a central store and then make inferences after a designated period, while in real-time, prediction is performed as soon as the inference request is made.Knowing this will enable you to effectively plan when and how to schedule compute resources, as well as what tools to use.

Raising and answering questions relating to data storage and retrieval is important and will get you thinking about the right way to design your ML project.

Your model isn’t going to train, run, and deploy itself. For that, you need frameworks and tooling, software and hardware that help you effectively deploy ML models. These can be frameworks like Tensorflow, Pytorch, and Scikit-Learn for training models, programming languages like Python, Java, and Go, and even cloud environments like AWS, GCP, and Azure.

After examining and preparing your use of data, the next line of thinking should consider what combination of frameworks and tools to use.

The choice of framework is very important, as it can decide the continuity, maintenance, and use of a model. In this step, you must answer the following questions:

  • What is the best tool for the task at hand?

Are the choice of tools open-source or closed?

How many platforms/targets support the tool?

To help determine the best tool for the task, you should research and compare findings for different tools that perform the same job. For instance, you can compare these tools based on criteria like:

Efficiency: How efficient is the framework or tool in production? A framework or tool is efficient if it optimally uses resources like memory, CPU, or time. It is important to consider the efficiency of Frameworks or tools you intend to use because they have a direct effect on project performance, reliability, and stability.

Popularity : How popular is the tool in the developer community? Popularity often means it works well, is actively in use, and has a lot of support. It is also worth mentioning that there may be newer tools that are less popular but more efficient than popular ones, especially for closed-source, proprietary tools. You’ll need to weigh that when picking a proprietary tool to use. Generally, in open source projects, you’d lean to popular and more mature tools for reasons I’ll discuss below.

Support : How is support for the framework or tool? Does it have a vibrant community behind it if it is open-sourced, or does it have good support for closed-source tools?How fast can you find tips, tricks, tutorials, and other use cases in actual projects?

Next, you also need to know whether the tools or framework you have selected is open-source or not. There are pros and cons to this , and the answer will depend on things like budget, support, continuity, community, and so on. Sometimes, you can get a proprietary build of open-source software, which means you get the benefits of open source plus premium support.

One more question you need to answer is how many platforms/targets does your choice of framework support? That is, does your choice of framework support popular platforms like the web or mobile environments? Does it run on Windows, Linux, or Mac OS? Is it easy to customize or implement in this target environment? These questions are important as there can be many tools available to research and experiment on a project, but few tools that adequately support your model while in production.

ML projects are never static. This is part of engineering and design that must be considered from the start. Here you should answer questions like:

How do we get feedback from a model in production?

  • How do you set up continuous delivery?

Getting feedback from a model in production is very important. Actively tracking and monitoring model state can warn you in cases of model performance depreciation/decay, bias creep, or even data skew and drift. This will ensure that such problems are quickly addressed before the end-user notices.

Consider how to experiment on, retrain, and deploy new models in production without bringing that model down or otherwise interrupting its operation. A new model should be properly tested before it is used to replace the old one. This idea of continuous testing and deploying new models without interrupting the existing model processes is called continuous integration.

There are many other issues when getting a model into production, and this article is not law, but I’m confident that most of the questions you’ll ask falls under one of the categories stated above.

An example of machine learning deployment

Now, I’m going to walk you through a sample ML project. In this project,you’re an ML engineer working on a promising project, and you want to design a fail-proof system that can effectively put, monitor, track, and deploy an ML model.

Consider Adstocrat , an advertising agency that provides online companies with efficient ad tracking and monitoring. They have worked with big companies and have recently gotten a contract to build a machine learning system to predict if customers will click on an ad shown on a webpage or not. The contractors have a large volume dataset in a Google Cloud Storage (GCS) bucket and want Adstocrat to develop an end-to-end ML system for them.

As the engineer in charge, you have to come up with a design solution before the project kicks off. To approach this problem, ask each of the questions asked earlier and develop a design for this end-to-end system.

Data concerns

First, let’s talk about the data. How is your training data stored?

The data is stored in a GCS bucket and comes in two forms. The first is a CSV file describing the ad, and the second is the corresponding image of the ad. The data is already in the cloud, so it may be better to build your ML system in the cloud. You’ll get better latency for I/O, easy scaling as data becomes larger (hundreds of gigabytes), and quick setup and configuration for any additional GPUs and TPUs.

The contractor serves millions of ads every month, and the data is aggregated and stored in the cloud bucket at the end of every month. So now you know your data is large (hundreds of gigabytes of images), so your hunch of building your system in the cloud is stronger.

Since data is stored in the GCS bucket, it can be easily retrieved and consumed by models built on the Google Cloud Platform. So now you have an idea of which cloud provider to use.

In terms of inference data, the contractors informed you that inference will be requested by their internal API, as such data for prediction will be called by a REST API. This gives you an idea of the target platform for the project.

Frameworks and tools for the project

There are many combinations of tools you can use at this stage, and the choice of one tool may affect the others. In terms of programming languages for prototyping, model building, and deployment, you can decide to choose the same language for these three stages or use different ones according to your research findings. For instance, Java is a very efficient language for backend programming, but cannot be compared to a versatile language like Python when it comes to machine learning.

After consideration, you decide to use Python as your programming language, Tensorflow for model building because you will be working with a large dataset that includes images, and Tensorflow Extended (TFX), an open-source tool released and used internally at Google, for building your pipelines. What about the other aspects of the model building like model analysis, monitoring, serving, and so on? What tools do you use here? Well, TFX pretty much covers it all!

TFX provides a bunch of frameworks, libraries, and components for defining, launching, and monitoring machine learning models in production. The components available in TFX let you build efficient ML pipelines specifically designed to scale from the start. These components has built-in support for ML modeling, training, serving, and even managing deployments to different targets.

experiment machine learning model

TFX is also compatible with our choice of programming language (Python), as well as your choice of deep learning model builder (Tensorflow), and this will encourage consistency across your team. Also, since TFX and Tensorflow were built by Google, it has first-class support in the Google Cloud Platform. And remember, your data is stored in GCS.

If you want the technical details on how to build a complete end-to-end pipeline with TFX, see the links below:

TensorFlow Extended (TFX) | ML Production Pipelines

Build and manage end-to-end production ML pipelines. TFX components enable scalable, high-performance data processing… www.tensorflow.org

The TensorFlow Blog

Creating Sounds Of India: An on device, AI powered, musical experience built with TensorFlow August 14, 2020 - Posted… blog.tensorflow.org

Python and TFX and Tensorflow are all open-source, and they are the major tools for building your system. In terms of computing power and storage, you are using all GCP which is a paid and managed cloud service. This has its pros and cons and may depend on your use case as well. Some of the pros to consider when considering using managed cloud services are:

  • They are cost-efficient
  • Quick setup and deployment
  • Efficient backup and recovery

Some of the cons are:

  • Security issue, especially for sensitive data
  • Internet connectivity may affect work since everything runs online
  • Recurring costs
  • Limited control over tools

In general, for smaller businesses like startups, it is usually cheaper and better to use managed cloud services for your projects.

TFX and Tensorflow run anywhere Python runs, and that’s a lot of places. Also, models built with Tensorflow can easily be saved and served in the browsers using Tensorflow.js , in mobile devices and IoT using Tensorflow lite , in the cloud , and even on-prem.

Feedback and Iteration concerns

TFX supports a feedback mechanism that can be easily used to manage model versioning as well as rolling out new models. Custom feedback can be built around this tool to effectively track models in production. A TFX Component called TensorFlow Model Analysis (TFMA) allows you to easily evaluate new models against current ones before deployment.

Looking back at the answers above, you can already begin to picture what your final ML system design will look like. And getting this part before model building or data exploration is very important.

Effectively putting an ML model in production does not have to be hard if all the boxes are ticked before embarking on a project. This is very important in an ML project you’ll embark on and should be prioritized!

While this post is not exhaustive, I hope it has provided you with a guide and intuition on how to approach an ML project to put it in production.

Thanks for reading! See you again another time.

APS Physics

  • Collections

experiment machine learning model

  • APS Journals

Harnessing Machine Learning to Guide Scientific Understanding

  • Department of Physics and Astronomy, University of Pennsylvania, Philadelphia, PA, US

Figure caption

Physical theories and machine-learning (ML) models are both judged on their ability to predict results in unseen scenarios. However, the bar for the former is much higher. To become accepted knowledge, a theory must conform to known physical laws and—crucially—be interpretable. An interpretable theory is capable of explaining why phenomena occur rather than simply predicting their form. Having such an interpretation can inform the scope of a new theory, allowing it to be applied in new contexts, while also connecting it to and incorporating prior knowledge. To date, researchers have largely struggled to get ML models (or any automated optimization process) to produce new theories that meet these standards. Jonathan Colen and Vincenzo Vitelli of the University of Chicago and their colleagues now show success at harnessing ML not as a stand-in for a researcher but rather as a guide to aid building a model of a complex system [ 1 ]. In a demonstration of their method, the researchers have identified a previously overlooked term that leads to a more complete understanding of dynamics in a fluidic system.

To build new models, physicists often observe phenomena (Fig. 1 a) in a controlled experiment (Fig. 1 b) and attempt to relate parameters of the system to each other with equations. Then, through a combination of intuition and trial and error, they modify the experiment, the theory, or both until they find a set of equations that describes the data. Prior knowledge—for instance, that the system should have no history dependence, that temperature is uniform, or that gravity can be ignored—vastly shrinks the space of possible solutions and of required experimental exploration. This severe narrowing of scope is usually necessary for us humans, as we find it extremely difficult to grapple with a problem in more than a handful of dimensions.

In contrast, ML models find more accurate and more generalizable solutions when given a (very) high-dimensional space to explore [ 2 ]. These models optimize enormous numbers of adjustable parameters until their predictions match the data. Unfortunately, the solutions found by generic ML models are often far too complicated and method dependent to extract a “why” [ 3 ]. Researchers applying such methods are therefore often limited to the unsatisfying claim that their data contains predictive information [ 4 ]. But what that information is and why it is predictive remains hidden in a black box of many messy variables (Fig. 1 c). Techniques to identify where in the data that predictive information resides are emerging [ 5 ], but they are rarely used in the scientific process. An alternative to complex ML models is to use algorithms that search libraries of possible equations to describe a system directly [ 6 ]. However, this tactic scales poorly with system complexity, making it difficult to use on phenomena of modern interest. To incorporate ML into the general discovery process requires a balance: The method should have sufficient free rein to unlock its potential but also a restricted terrain on which the results will be interpretable.

Colen, Vitelli, and their colleagues now do just that using a sequence of ML algorithms [ 1 ]. Their work focuses on a paradigmatic problem in hydrodynamics: a single-file queue of water droplets in a microfluidic channel, suspended in a second fluid that causes them to interact and form a propagating shock front. This system has been previously modeled by a partial differential equation that describes the changing fluid density. But the equation, called Burgers’ equation, fails to capture key aspects of the system’s dynamics. To uncover the missing physics, the researchers first train an ML model to predict the time evolution of the 1D droplet density field 𝜌 —in other words, they task their algorithm to find a function M that maps the initial density 𝜌 0 forward in time: M [ t , 𝜌 0 ] = 𝜌 ( t ) .

To make their model interpretable, the researchers construct it from three successive operations. First, a neural network N transforms the density into a new 1D field, which they call 𝜑 0 = N [ 𝜌 0 ] . While this “latent” field does not have an easily interpretable physical meaning, it only contains information about the initial density field. Second, this field is fed into a function, called F , that steps it forward in time—in other words, F ( 𝜑 0 , t ) = 𝜑 ( t ) . The researchers restrict the form of F to a set of linear operations. Finally, the field is transformed back into density by another neural network, essentially an inverse of the first step. (Mathematically, the entire process can be described as M [ t , 𝜌 0 ] = N − 1 [ F ( t , N [ 𝜌 0 ] ) ] = 𝜌 ( t ) and is drawn schematically in Fig. 1 d, top). By simultaneously optimizing all three steps to match experimental data, the researchers found better predictions than those made by Burgers’ equation.

The researchers then utilized an algorithm that finds simplified analytical approximations of numerical functions [ 6 ]. This step would fail for a typical neural network trained on the experimental data (Fig. 1 c). But, notably, it produces a five-term linear partial differential equation as a good stand-in for F . Despite this equation operating on the (uninterpretable) latent variable 𝜑 0 , F ’s role as a time propagator makes each term’s meaning intelligible at a high level. Specifically, the researchers identify one of the differential terms as connected to dispersion—a frequency dependence in the wave speed of the fluid. Such a dispersive term is not present in Burgers’ equation, but the team found that its addition produces more accurate descriptions of the shock-front dynamics that arise in the droplet density field. Finally, the team develops a model of interacting droplets and finds that this added dispersive term is a direct consequence of nonreciprocal hydrodynamic interactions.

This work provides an exciting use of ML as a compass during scientific exploration, which requires a fundamentally different approach than standard ML practice, where models are judged primarily by their prediction accuracy. However, for scientific exploration the “best” models are the ones that lead to physical insight (the “why”) but may not be the most accurate. In fact, the team found that adding the key dispersive term actually raised the predictive error slightly compared to other ML models applied to the same problem; however, it clearly captured missing physics occurring near the shock front. Rather than lower error, it was closing the loop with a continuum model and identifying the source of this dispersive term that allowed Colen, Vitelli, and colleagues to solidify their conclusions. This workflow dovetails with recent work here at the University of Pennsylvania using ML as an experimental guide [ 7 ], wherein the simplest and “weakest” (least-predictive) models trained to predict clogging in granular materials gave the most insight, prompting experiments that solidified their interpretation.

Increases in computational power have massively accelerated analysis of scientific data, yet our exploration of that data often remains entirely human driven. As physicists study increasingly complex emergent phenomena, the dimensions of potential physical models and therefore the complexity of required experimental exploration grows rapidly. While standard analysis tools allow us to identify robust trends, it may not be feasible to hunt down highly nonlinear, history-dependent, and multiscale effects in (necessarily) messy data without a guide capable of ingesting 100 dimensions at once. To study such phenomena, fluency with both the subject matter and ML tools may prove an invaluable combination, both as an experimental guide and a theoretical one.

  • J. Colen et al. , “Interpreting neural operators: How nonlinear waves propagate in nonreciprocal solids,” Phys. Rev. Lett. 133 , 107301 (2024) .
  • J. W. Rocks and P. Mehta, “Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models,” Phys. Rev. Res. 4 , 013201 (2022) .
  • C. Rudin et al. , “Interpretable machine learning: Fundamental principles and 10 grand challenges,” Statist. Surv. 16 , 1 (2022) .
  • S. Dillavou et al. , “Beyond quality and quantity: Spatial distribution of contact encodes frictional strength,” Phys. Rev. E 106 , 033001 (2022) .
  • K. A. Murphy and D. S. Bassett, “Information decomposition in complex systems via machine learning,” Proc. Natl. Acad. Sci. U.S.A. 121 , 13 (2024) .
  • S. L. Brunton et al. , “Discovering governing equations from data by sparse identification of nonlinear dynamical systems,” Proc. Natl. Acad. Sci. U.S.A. 113 , 3932 (2016) .
  • J. M. Hanlan et al. , “Cornerstones are the key stones: Using interpretable machine learning to probe the clogging process in 2D granular hoppers,” arXiv:2407.05491 .

About the Author

Image of Sam Dillavou

Sam Dillavou is a postdoctoral researcher working with Douglas Durian and Andrea Liu at the University of Pennsylvania. He is an experimentalist interested in emergent behaviors of complex systems and ways in which machine learning (ML) can inform and support these endeavors. His projects include constructing analog electronic networks that learn as a consequence of physical dynamics and without any central control (“physical learning”) and probing driven, out of equilibrium systems like granular flows, where ML has been useful as an experimental guide. He completed his PhD in physics at Harvard University under the guidance of Shmuel Rubinstein, studying dynamics of static and sliding bodies and memory effects in disordered materials.

Interpreting Neural Operators: How Nonlinear Waves Propagate in Nonreciprocal Solids

Jonathan Colen, Alexis Poncet, Denis Bartolo, and Vincenzo Vitelli

Phys. Rev. Lett. 133 , 107301 (2024)

Published September 3, 2024

Subject Areas

Related articles.

Time Delays Improve Performance of Certain Neural Networks

Time Delays Improve Performance of Certain Neural Networks

Both the predictive power and the memory storage capability of an artificial neural network called a reservoir computer increase when time delays are added into how the network processes signals, according to a new model. Read More »

Ocean Currents Resolved on Regional Length Scales

Ocean Currents Resolved on Regional Length Scales

Using a detailed simulation, researchers reveal how climate change will affect the regional dynamics of the conveyor-belt-like circulation of water through the Atlantic Ocean. Read More »

Predicting Tipping Points in Complex Systems

Predicting Tipping Points in Complex Systems

A machine-learning framework predicts when a complex system, such as an ecosystem or a power grid, will undergo a critical transition. Read More »

Sign up to receive weekly email alerts from Physics Magazine .

experiment machine learning model

Experiment Management: How to Organize Your Model Development Process

Jakub Czakon

You will learn about

What is experiment management.

  • code versions
  • data versions
  • hyperparameters
  • environment

Tracking ML experiments

  • share your results and insights with the team (and you in the future),
  • reproduce results of the machine learning experiments,
  • keep your results, that take a long time to generate, safe.

Code version control for data science

Problem 1: jupyter notebook version control.

  • nbconvert (.ipynb -> .py conversion)
  • nbdime  (diffing)
  • jupytext  (conversion+versioning)
  • neptune-notebooks  (versioning+diffing+sharing)

Problem 2: Experiments on dirty commits

“But how about tracking code in-between commits? What if someone runs an experiment without committing the code?”

Tracking hyperparameters

Config files, command line + argparse, parameters dictionary in main.py, magic numbers all over the place, data versioning.

  • new images are added,
  • labels are improved,
  • mislabeled/wrong data is removed,
  • new data tables are discovered,
  • new features are engineered and processed,
  • validation and testing datasets change to reflect the production environment.

“Storage is cheap, training a model for 2 weeks on an 8-GPU node is not.”

Tracking machine learning metrics, “log metrics, log them all”, versioning data science environment.

“I don’t understand, it worked on my machine.”

Docker images

Conda environments.

environment versioning Experiment Management: How to Organize Your Model Development Process

How to organize your model development process?

  • how to search through and visualize all of those experiments,
  • how to organize them into something that you and your colleagues can digest,
  • how to make this data shareable and accessible inside your team/organization?
  • filter/sort/tag/group experiments,
  • visualize/compare experiment runs,
  • share (app and programmatic query API) experiment results and metadata.

Working in creative iterations

  • your first solution  is good enough  to satisfy business needs,
  • you can reasonably expect that there is  no way to reach business goals  within the previously assumed time and budget,
  • you discover that there is a  low-hanging fruit problem somewhere close  and your team should focus their efforts there.

Model results exploration

  • it leads to business problem understanding,
  • it leads to focusing on the problems that matter and saves a lot of time and effort for the team and organization,
  • it leads to discovering new business insights and project ideas.
  • “Understanding and diagnosing your machine-learning models” PyData talk by Gael Varoquaux

  • “Creating correct and capable classifiers” PyData talk by Ian Osvald
  • “Using the ‘What-If Tool’ to investigate Machine Learning models”  article by Parul Pandey

Final thoughts

  • what experiment management is,
  • how organizing your model development process improves your workflow.

Experfy Insights

Top articles, research, podcasts, webinars and more delivered to you monthly.

experiment machine learning model

AI And Analytics: Optimizing Your Digital Customer Journey Map

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

experiment machine learning model

Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

experiment machine learning model

Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

experiment machine learning model

The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

1700 West Park Drive, Suite 190 Westborough, MA 01581

Email:  [email protected]

Toll Free: (844) EXPERFY or (844) 397-3739

IMAGES

  1. Machine Learning: How to Build Scalable Machine Learning Models

    experiment machine learning model

  2. Experiment design of the proposed machine learning model.

    experiment machine learning model

  3. Machine Learning Models

    experiment machine learning model

  4. The Six Steps of Creating a Machine Learning Model in Knowi

    experiment machine learning model

  5. The Machine Learning Workflow Explained (and How You Can Practice It

    experiment machine learning model

  6. Top Machine Learning Models and Explained

    experiment machine learning model

VIDEO

  1. How to interpret the results of a machine learning model?

  2. Probabilistic ML

  3. How to interpret the results of a machine learning model?

  4. Training a machine learning model from scratch to predict housing prices

  5. Machine Learning Model

  6. Learn Building Machine Learning Model in 1 minute|M learning flow with ETL|AI data pipeline

COMMENTS

  1. A quick guide to managing machine learning experiments

    Machine learning experiment: A systematic procedure to test a hypothesis (e.g. model A is better than model B, Hyperparameters X has a positive effect on response Y); Variables: Controllable factors that you vary and measure response (e.g. model architectures, hyperparameters); Trial: A training iteration on a specific variable set.A variable set could be sampled from an exhaustive set of ...

  2. How to Plan and Run Machine Learning Experiments Systematically

    In this post, you will discover a simple approach to plan and manage your machine learning experiments. With this approach, you will be able to: Stay on top of the most important questions and findings in your project. Keep track of what experiments you have completed and would like to run. Zoom in on the data preparations, models, and model ...

  3. Your First Machine Learning Project in Python Step-By-Step

    In this step-by-step tutorial you will: Download and install Python SciPy and get the most useful package for machine learning in Python. Load a dataset and understand it's structure using statistical summaries and data visualization. Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.

  4. Machine Learning Experimentation: An Introduction to Weights ...

    Machine learning experimentation is the process of designing, building, logging, and evaluating a machine learning pipeline to identify the model that achieves a desired performance in terms of a given metric or metrics, e.g., F1 score, RMSE, etc. It is a crucial step in developing machine learning models both prior to deployment and as part of ...

  5. Machine Learning Experiment Management: How to Organize Your Model

    Machine learning or deep learning experiment tracking is a key factor in delivering successful outcomes. There's no way you will succeed without it. Let me share a story that I've heard too many times. "So I was developing a machine learning model with my team and within a few weeks of extensive experimentation, we got promising results…

  6. ML Experiment Tracking: Why It Matters, How to Implement it

    Experiment tracking is closely intertwined with other aspects of MLOps, such as data and model versioning. MLOps cycle and machine learning experiment tracking. Experiment tracking is useful even if your models don't make it to production (yet). In many research-focused projects, you might never even get there.

  7. The Ultimate Guide to Evaluation and Selection of Models in ML

    Step 3: Keep track of your experiment results. Whether you use a spreadsheet or a dedicated experiment tracker, make sure to log all the important metrics, learning curves, dataset versions, and configurations. You will thank yourself later. ... Model selection in machine learning (choosing model validation strategy) Resampling methods ...

  8. The Art of Machine Learning Experimentation

    Jan 17, 2023. 127. Photo by Oudom Pravat on Unsplash. Experimentation is at the heart of the Machine Learning profession. We progress because we experiment. However, not all experiments are equally meaningful. Some create more business impact than others. Yet, the art of selecting, executing, and iterating on experiments with a focus on impact ...

  9. Experiment Tracking in Machine Learning (Complete Guide)

    For example, model_v3_lr0.01_bs64_acc0.82.h5 might indicate version 3 of a model with a learning rate of 0.01, a batch size of 64, and an accuracy of 82%. What a typical manual machine learning experiment tracking spreadsheet might look like.

  10. Controlled Experiments in Machine Learning

    Controlled Experiments in Machine Learning. By Jason Brownlee on August 8, 2019 in Statistics 4. Systematic experimentation is a key part of applied machine learning. Given the complexity of machine learning methods, they resist formal analysis methods. Therefore, we must learn about the behavior of algorithms on our specific problems empirically.

  11. Intro to MLOps: Experiment Tracking for Machine Learning

    In the machine learning workflow, experiment tracking is the process of saving relevant metadata for each experiment and organizing the experiments. In this context, an ML experiment is a ...

  12. 7 Best Tools for Machine Learning Experiment Tracking

    6. Comet ML. Comet ML Experiment Tracking is a free ML experiment-tracking tool for the community. You can manage your experiment with a simple Python, Java, and R API that works with all of the popular machine learning frameworks, such as Keras, LightGBM, Transformers, and Pytorch. Image from Comet ML.

  13. Experiment Tracking

    Managing and tracking machine learning experiments. ... (including hyperparameter optimization). Experiment tracking is the process of managing all the different experiments and their components, such as parameters, metrics, models and other artifacts and it enables us to: ... We'll start by setting up our model registry where all of our ...

  14. Tutorial Part 3: Train and register a machine learning model

    Register the trained machine learning model. Assess the performances of the trained machine learning models on the validation dataset. MLflow is an open source platform for managing the machine learning lifecycle with features like Tracking, Models, and Model Registry. MLflow is natively integrated with the Fabric Data Science experience.

  15. Machine learning experiment

    A machine learning experiment is the primary unit of organization and control for all related machine learning runs. A run corresponds to a single execution of model code. In MLflow, tracking is based on experiments and runs. Machine learning experiments allow data scientists to log parameters, code versions, metrics, and output files when ...

  16. Machine Learning Experiment Tracking

    Machine Learning code generally doesn't throw errors, it just underperforms, making debugging extra difficult and time consuming. A single small change in training data, training code or hyperparameters can wildly change a model's performance, so reproducing earlier work often requires exactly matching the prior setup.

  17. Evaluate AutoML experiment results

    In the left menu, select Jobs. Select your experiment from the list of experiments. In the table at the bottom of the page, select an automated ML job. In the Models tab, select the Algorithm name for the model you want to evaluate. In the Metrics tab, use the checkboxes on the left to view metrics and charts.

  18. Steps to Build a Machine Learning Model

    Understanding the Fundamentals of Machine Learning. Comprehensive Guide to Building a Machine Learning Model. Step 1: Data Collection for Machine Learning. Step 2: Preprocessing and Preparing Your Data. Step 3: Selecting the Right Machine Learning Model. Step 4: Training Your Machine Learning Model. Step 5: Evaluating Model Performance.

  19. Machine learning model experiments

    Machine learning experiment tracking enables them to log parameters, metrics, and artifacts directly into GitLab, giving easy access later on. These features have been proposed: Searching experiments. Visual comparison of candidates. Creating, deleting, and updating candidates through the GitLab UI. For feature requests, see epic 9341.

  20. 8 Machine Learning Models Explained in 20 Minutes

    Image by author. 1. Accuracy: Accuracy can be defined as the fraction of correct predictions made by the machine learning model. The formula to calculate accuracy is: In this case, the accuracy is 46, or 0.67. 2. Precision: Precision is a metric used to calculate the quality of positive predictions made by the model.

  21. ModelDB: An open-source system for Machine Learning model versioning

    ModelDB is an open-source system to version machine learning models including their ingredients code, data, config, and environment and to track ML metadata across the model lifecycle. Use ModelDB in order to: Make your ML models reproducible. Manage your ML experiments, build performance dashboards, and share reports.

  22. PDF Runway: machine learning model experiment management tool

    ABSTRACT. Runway is a cloud-native tool for managing machine learning ex-periments and their associated models. The iterative nature of devel-oping models results in a large number of experiments and models that are often managed in an ad hoc manner. Runway is a workflow and framework independent tool that centrally manages and main-tains ...

  23. How to put machine learning models into production

    The goal of building a machine learning model is to solve a problem, and a machine learning model can only do so when it is in production and actively in use by consumers. As such, model deployment is as important as model building. As Redapt points out, there can be a "disconnect between IT and data science. IT tends to stay focused on ...

  24. Machine Learning Experiment Tracking Using MLflow

    Machine learning experiment tracking and model management software called MLflow makes it easier to handle machine learning projects. It provides a variety of tools and functions to simplify the ML workflow. Users may compare and replicate findings, log parameters and metrics, and follow MLflow experiments.

  25. Harnessing Machine Learning to Guide Scientific Understanding

    (b) A well-designed experiment combined with known physical laws leads to a simple cause-effect relationship, but designing such an experiment for complex phenomena can be extremely challenging. (c) A generic machine-learning (ML) model (neural network) optimizes many parameters to fit even complex data, but discerning the meaning of this ...

  26. Experiment Management: Model Development Process & How to Organize

    Experiment management in the context of machine learning is a process of tracking experiment metadata like: code versions. data versions. hyperparameters. environment. metrics. organizing them in a meaningful way and making them available to access and collaborate on within your organization.

  27. New Machine-Learning Model Offers Simple Solution to Predicting Crop

    A new machine-learning model for predicting crop yield using environmental data and genetic information can be used to develop new, higher-performing crop varieties. ... and the mean prediction accuracy improved 7 percent over the established model. The experiment was validated in three scenarios typically encountered in plant breeding.

  28. A physics-informed machine learning model for global-local stress

    The machine learning model aims to correct the approximate solution given by the analytical solution. The use of computationally efficient analytical solutions to inform the input of these machine learning models may result in higher accuracy predictions and/or reduce the amount of numerical data required to train the machine learning models ...

  29. Course: W7S555G: IBM watsonx.ai: Rapid Machine Learning Model

    This course aims to familiarize data science and analytics professionals with the fundamentals of IBM watsonx.ai's AutoAI tool in building and evaluating AutoAI experiments for various use cases.

  30. New Machine Learning Model Offers Simple Solution to Predicting Crop Yield

    A new machine-learning model for predicting crop yield using environmental data and genetic information can be used to develop new, higher-performing crop varieties. Igor Fernandes, a statistics and analytics master's student at the University of Arkansas, entered agriculture studies with a data science background and some exposure to agronomy as an undergraduate assistant for Embrapa, the ...