MLOps is a set of processes and automation for managing models, data and code to improve
performance stability and long-term efficiency in ML systems. It includes ModelOps, DataOps, and
DevOps. It is an iterative process, so organisations should continuously refine and improve each step as
they gain insights and learn from the performance of models in production. As well as a set of processes
and toolset, MLOps is a culture that embraces collaboration, continuous learning, and experimentation.


Define the ML Problem and Goals:
• Clearly define the problem you want to solve using machine learning techniques.
• Set specific goals and metrics to measure the success of your ML model.
• Consider if a non-ML or simpler approach will produce a better result for the specific situation. ML is not always the best solution.
• Understand the business problem and how the ML solution will impact the business outcomes (cost of errors, model maintenance, and continuous monitoring).

Data Collection and Preprocessing

• Identify and collect relevant data for your ML project.
• Preprocess the data by cleaning, transforming, and normalizing it to make it suitable for training your models.
• Split the data into training, validation, and test sets.

Model Development and Training
• Choose an appropriate machine learning algorithm or model architecture for your problem.
• Develop the initial version of your model using a training dataset.
• Iterate on your model by experimenting with different algorithms, hyperparameters, and features.
• Train your model using the training dataset and evaluate its performance on the validation set.

Model Deployment
• Once you have a trained and validated model, prepare it for deployment.
• Containerize your model using tools like Docker to ensure consistent and reproducible deployments.
• Create an API or service to expose your model’s predictions to other systems.
• Set up a versioning system to keep track of different model versions.


Monitoring and Logging
• Implement monitoring and logging mechanisms to track the performance and behaviour of your deployed model.
• Monitor data drift to detect changes in the data distribution and ensure model accuracy.
• Set up logging to record predictions, errors, and other relevant information for troubleshootingand auditing purposes.
• Tools for model understanding and fairness are integrated within the MLOps cycle.
• If there is a model failure or unexpected issues created, MLOps will have a method for rolling back to a previous stable version.
• A disaster recovery plan is put in place in the event of pipelines errors, outages or failures.



Continuous Integration and Continuous Deployment (CI/CD)
• Implement CI/CD pipelines to automate the process of building, testing, and deploying your ML models.
• Set up a version control system (e.g., Git) to manage your codebase and collaborate with other team members and enable developers / data scientists to track experiments and record/compare parameters and results.

• Integrate automated testing to ensure the quality and reliability of your ML models.



Model Monitoring and Retraining
• Continuously monitor the performance of your deployed model in production.
• Collect feedback and gather additional data to improve your model.
• Incorporating real-time user feedback and changing business requirements or environment is a critical step in MLOps.
• Feedback loops allow the model to reiterate as the business environment changes.
• Periodically retrain your model using updated data to maintain its accuracy and effectiveness.
• A system may be added to allow SME review for edge cases or complex situations.
• Model lifecycle management in the event that older models are no longer providing value or outdated and need to be retired.


Infrastructure Management
• Establish infrastructure management practices to handle the deployment, scaling, and maintenance of your ML infrastructure.
• Environment Separation: Separate different stages of ML code into different environments with clearly defined transitions between stages. For example, split environment into development (DEV), staging/(TEST) and production (PROD).
• Use infrastructure-as-code tools like Docker or Kubernetes for containerisation and orchestration.
• Cost monitoring and optimization ensures that storage is efficiently managed and the right compute resources are selected. 


Data Governance and Quality Assurance
• Having appropriate data governance practices in place to provide foundational stability to the models.
• Data quality assurance checks should be conducted at all stages.
• Data formats and data management practices should be standardized.
• These processes reduce the likelihood of data issues affecting model performance and reliability.



Collaboration and Documentation
• Document your ML experiments, processes, and decisions to facilitate knowledge sharing and reproducibility.
• Regularly share findings to create a Centre of Excellence (COE).
• Foster collaboration within your team by using tools like Databricks or Jupyter Notebooks, Git, and project management platforms.


Security and Privacy
• Ensure that proper security measures are in place to protect sensitive data used in your ML models.
• Implement privacy safeguards to comply with regulations such as the Privacy Act, or GDPR/HIPAA etc. internationally if applicable.


We currently consider it best practice to use open-source tools as much as possible across widely popular platforms such as Azure, AWS and GCP. Databricks is a current front runner because it is a unified data analytics platform built on Apache Spark that provides a collaborative environment for data engineering, data science, and machine learning. There are of course many other options too – below we have suggested some options.

Potential High-level Architectures
• With Databricks on Azure: Azure Storage Account (Data source) + Azure DevOps + Azure Databricks

• Without Databricks on Azure: Azure Storage Account (Data source) + Git repository (the place to hold code) + Azure DevOps + Azure Machine Learning

• Amazon Web Services (AWS): AWS offers various services for ML, such as Amazon SageMaker for training and deploying models, AWS Lambda for serverless computing, and AWS Batch for batch processing. Databricks is also available on AWS.

• Google Cloud Platform (GCP): GCP offers tools like Google Cloud AI Platform for ML model training and deployment, Google Kubernetes Engine (GKE) for container orchestration, and Cloud Functions for serverless computing. Databricks is also available on AWS.

• Multi-cloud strategies can leverage the benefits of different cloud providers, when appropriate.

There are also several other popular tools available for implementing MLOps. These tools provide various functionalities and integrations to streamline the ML lifecycle and implement best practices in MLOps. The choice of tools may depend on your specific requirements, infrastructure, and preferences. Here are some of the suggestions via Chat GPT:

Version Control System
• Git: Git is a widely adopted version control system for tracking changes in code, data, and model versions. It allows for collaboration, code review, and easy branching and merging.



Containerization and Orchestration
• Docker: Docker is a popular containerization platform that allows you to package your ML models and dependencies into portable containers. It ensures consistency across different environments.

• Kubernetes: Kubernetes is an orchestration tool that helps manage and scale containerized applications. It provides features like automatic scaling, load balancing, and self-healing.

• Continuous Integration and Continuous Deployment (CI/CD)

• Jenkins: Jenkins is an open-source automation server that enables CI/CD pipelines. It allows you to automate building, testing, and deploying ML models.

GitLab CI/CD: GitLab provides a built-in CI/CD platform that integrates with Git repositories. It supports continuous integration, automated testing, and deployment pipelines.


Monitoring and Observability

• Prometheus: Prometheus is an open-source monitoring and alerting toolkit. It can be used to collect and store metrics from your ML models and infrastructure.
• Grafana: Grafana is a visualization tool that integrates with Prometheus and other data sources. It allows you to create customizable dashboards for monitoring and observability.



Experimentation and Model Tracking
• MLflow: MLflow is an open-source platform for managing the ML lifecycle. It provides tools for tracking experiments, managing models, and reproducing results.
• TensorBoard: TensorBoard is a visualization toolkit provided by TensorFlow. It helps in visualising and monitoring model training metrics, graph visualization, and profiling.



Automation and Infrastructure as Code
• Terraform: Terraform is an infrastructure-as-code tool that enables you to define and manage your ML infrastructure declaratively. It supports multiple cloud providers and can provision resources consistently.



Collaboration and Documentation

• Databricks or Jupyter Notebooks: Notebooks provide an interactive environment for developing and documenting ML models. They enable code execution, visualizations, and narrative text.
• Confluence: Confluence is a popular team collaboration and documentation platform. It can be used to share knowledge, document processes, and collaborate on ML projects

Contact Us