Machine Learning Operations 101
This week, we are going to demystify and discuss Machine Learning Operations, and delve into what it entails, why is it important, and what can be done to help facilitate a team that wants to improve their technical and cultural processes around AI.
There has been a lot of focus on generative AI recently, but whilst having a good Machine Learning Operations foundation will indeed allow you to harness its potential, MLOps will help any team that is looking to use any kind of machine learning process. This includes time-series forecasting, statistical NLP, and of course, the running, training, and maintenance of “foundation models.”
With the rise of generative AI, a lot of focus (and no undue amount of hype) quickly went to the capabilities of Large Language Models (LLM). And not without good reason. The generative capabilities of the GPT and LlamA families of models have myriad use cases: from summarisation to information extraction in domains as varied as marketing, software design, engineering, and finance.
While it is clear that these tools cannot do everything and the initial hype has sensibly deflated, a growing number of stakeholders, from software developers to senior leadership, are now asking, “How can we best utilize this technology to differentiate ourselves?”
A quick Google search will quickly bring up the topics of “MLOps,” “LLMOps,” or “Continuous Model Deployment.” Indeed, more job openings have become available looking for ML engineers or ML ops developers. Outside of these specialist roles, a request emerges – whether explicit or implicit – for software engineers and data scientists to learn some of these skills and take over those responsibilities. But beneath the surface, it becomes clear that the term holds very different things to very different people. And yet, the clamor to “implement MLOps” is growing.
So, let’s discuss the different facets of Machine Learning Operations, the skillsets and requirements for taking machine learning models from prototype through to production, and finally try to answer the question: “So what really is MLOps anyway, and why on Earth should I even care?”
Defining MLOps – differing and competing definitions
Becky Gorringe, host of the “Let’s Talk MLOps Podcast,” asked this very question recently.
- Is it a set of practices?
- Is it a culture?
- Is it a mindset?
- Is it about shipping science faster?
- Is it about getting value back from data science & machine learning?
By looking at the different ways of defining MLOps, we can understand what it offers, and why it might be important:
“Machine Learning Operations is an approach to managing the entire lifecycle of a machine learning model — including its training, tuning, everyday use in a production environment and retirement.”
“MLOps stands for Machine Learning Operations. It is a core function of machine learning engineering, focused on streamlining the process of taking machine learning models to production and then maintaining and monitoring them. MLOps is a collaborative function, often comprising data scientists, developers, engineers, and IT.”
However, other than agreeing that making machine learning work in production is important, these definitions don’t elaborate on what is required.
Understanding machine learning and the right skillset
Based on the definitions above, we can agree that MLOps requires technical knowledge of machine learning. Additionally, there is a reference to some sort of culture or collaboration. So, an MLOps engineer will likely care about how models are trained, how they operate, and how they can be made more efficient in real use.
But that is not enough. Maintaining a model requires an understanding of their biases, and being able to know how the data being passed into it at inference time differs from the data it was trained upon. That’s why there’s a need for some data science skills too.
Finally, it also requires some level of understanding of system design and operationalizing software. There are deployment decisions and questions to answer. For example:
- What happens if 20x traffic happens on my model overnight?
- What happens if someone sends unexpected input to my model?
- How do I know when something goes wrong?
- Should I continue scaling out services when demand increases? Or do I set a cap? If so, based on what?
When you start to factor in questions about cost, and system design decisions that can affect the entire output of your technological stack, it becomes evident that you’re no longer making these choices in isolation. Therefore, a need arises to communicate the needs of your ML ecosystem articulately and carefully to stakeholders, bearing in mind that these stakeholders along with their experience and priorities can vary.
That’s why an effective MLOps function combines the skills of an infrastructure engineer, a software architect, and a data scientist, as well as a sufficient level of product knowledge to understand the “why” behind the implementation of the machine learning solution.
Use the power of AI to speed up your text analysis
Bridging the gap between implementation and value
Even with the incorporation of MLOps tools budgeted for, we’re still a long way away from the finish line.
The focus then shifts towards optimizing and making the most out of the tooling. Leveraged effectively, you can be certain that you’ve effectively put machine learning at the center of your tech workflow and answered questions, such as:
- How do we use the data we’ve gathered about a production model to improve it best?
- Are there cheaper ways to maintain our current ML offering?
- How can we best adapt our services to suit how our customers are using them?
A good and multi-disciplined team will be able to use the tooling that’s been built to answer these questions. But it’s important to understand that while spending time on complex technical tools and monitoring is helpful, the real value comes when we share and use the information we learn. If the output cannot be communicated effectively, and the people using it are not in a position to use that information efficiently, all of the potential benefits can be lost.
Cultivating the right mindset and culture
To make the most of the technology available, the engineers, scientists and other stakeholders involved in the ecosystem have to be empowered and enabled to take point on data-driven decision-making. There also needs to be an open and communicative culture that is focused on continuous improvement. Novel ideas and alterations (whether technical or cultural) won’t be raised in an environment where people are not listened to.
Think of it like this: you’re not only iterating on your machine learning pipelines by using new data to make better predictions on some input; you also need to iterate on your ways of working, using new data to make better decisions.
Some would argue that this fits neatly into agile workflows. It does – but following agile is not a pre-requisite for an MLOps team. All you need is to commit to a culture of knowledge sharing, feedback, and mutual enrichment.
Creating an environment where knowledge-sharing is the standard has other benefits too. One of them is a better connection between the needs of your customers, the state of the product, and the people who are responsible for building and delivering it. This is where you can get further value from a data science function.
If you can understand the pain points of a customer’s experience, and see where innovation is needed within your platform, you can use data science to deliver and test proof-of-concept ideas. If you have a fast and continuously deploying ML stack, then this is a force multiplier for a data science team. It enables that team to quickly prototype and iterate on models to fit specific areas and use cases.
Empowering Machine Learning Operations
Going back to the question posed at the beginning, we now perhaps have an answer.
MLOps is to machine learning what DevOps is to development.
It’s about mentality and empowerment to put the responsibility for the management of ML systems into the hands of the people that create them. However, to feel empowered to iterate and improve an ML model, you need to have a culture that values measuring outputs and acting on the data.
To achieve this, a retraining pipeline or cycle is needed to allow to “ship science faster.” This is where the deployment and infrastructural requirements come in. Engineers and data scientists need the buy-in from management and leadership here to allocate resources in these domains.
Combining all of the above is what allows stakeholders to get value from data science and machine learning. This value isn’t just rooted in putting machine learning at the center of your technologies, but also in the ability to learn from how those models are used to improve them in a way that would otherwise be impossible.
This may sound straightforward when put into a paragraph, but it doesn’t mean it’s easy to execute. Building a communicative culture isn’t a one-and-done thing.
Just like a garden, it requires constant and careful tending (without over-meddling and interference) to grow the richest benefits. Similarly, technological stack requires experience, bravery, patience, and desiloization to work effectively.
However, the companies that are successfully navigating this path and leveraging ML effectively are starting to gain momentum. Following good MLOps principles is one of the fastest ways to set yourself apart from the competition.
Recommendations and resources
If you want to learn more about MLOps, here are some recommended areas and communities to explore further:
đź“Ž Newsletters:
Marvelous MLOps – by Maria Vechtomova, BaĹźak Tuğçe Eskili and RaphaĂ«l Hoogvliets
Growing ML Platforms and Ops – by Mikiko Bazeley
🧑‍🤝‍🧑 Communities:
MLOps Community Slack Group
🎧 Podcasts:
Let’s Talk MLOps – by Becky Gorringe
The MLOps Podcast – by Dean Pleban
Leave a Comment